[Wien] 'SECLR4' - POTRF error (again)

Fri Oct 6 10:23:37 CEST 2006

Dear all

Sorry for coming back to this kind of error message again. I've been searching the mailing list and it seems 
that the appearance of this kind of error message is due to a wrong structure file. I have never seen any 
reply saying this was in fact the reason and a new structure file cured the problem. And my case is not that 
clear. I have been running 64 atom supercells calculations, both on a single CPU and in parallel using 20 CPUs 
(according to the k-point grid, this should be more or less fine). They both began running on perfectly. The 
single CPU job still is running, in fact. When convergence had been reached for the parallel run, I noticed 
there were noticeable forces in my scf file, so I tried to run a mini job. The same kind of job works 
perfectly with smaller systems, say, 16 atom supercells. This second job started OK, see below

 > PBS Job Id: 215051.nid00003
 > Job Name:   Fe64Cr1m
 > Execution terminated
 > Exit_status=9
 > resources_used.cpupercent=50
 > resources_used.cput=02:01:21
 > resources_used.mem=174356kb
 > resources_used.ncpus=1
 > resources_used.vmem=535764kb
 > resources_used.walltime=06:52:58

and reached successfully the end of the first cycle it had to complete, that is, I had a case_1.scf file in my 
running directory.

However, when starting the second cycle, there came again the SECLR4 error, see the tail of the stderr file:

 >>>  (mini) arrived at end -> exit (this stand for the first mini cycle)
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 > SECLR4 - Error
 > SECLR4 - Error
 > SECLR4 - Error
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 > === BEGIN EPILOGUE ===
 > Fri Oct  6 06:37:30 CEST 2006
 > JID: 215051.nid00003
 > EUID: iglesias
 > EGID: psi
 > Job Name: Fe64Cr1m
 > SID: 17809
 > Requested limits: admin_cookie=170226163,ncpus=1,partition_id=221261,size=20,walltime=08:00:00
 > Resources used: cpupercent=50,cput=02:01:21,mem=174356kb,ncpus=1,vmem=535764kb,walltime=06:52:58
 > Queue: normal
 > Accounting string: DEFAULT
 > === END EPILOGUE ===

And when I try to restart it, everytime I get the same error in lapw1, and with the same walltime, exactly 
6:00 minutes:

 > PBS Job Id: 216025.nid00003
 > Job Name:   Fe63Cr1m2
 > Execution terminated
 > Exit_status=9
 > resources_used.cpupercent=39
 > resources_used.cput=00:00:56
 > resources_used.mem=174400kb
 > resources_used.ncpus=1
 > resources_used.vmem=535664kb
 > resources_used.walltime=00:06:00

and stderr:

 > === END PROLOGUE ===
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 >  LAPW0 END
 > SECLR4 - Error
 > SECLR4 - Error
 > SECLR4 - Error
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 >  LAPW1 END
 > === BEGIN EPILOGUE ===
 > Fri Oct  6 09:23:58 CEST 2006
 > JID: 216114.nid00003
 > EUID: iglesias
 > EGID: psi
 > Job Name: Fe63Cr1m2
 > SID: 5834
 > Requested limits: admin_cookie=475063425,ncpus=1,partition_id=222162,size=20,walltime=08:00:00
 > Resources used: cpupercent=12,cput=00:00:55,mem=174320kb,ncpus=1,vmem=535664kb,walltime=00:06:00
 > Queue: normal
 > Accounting string: DEFAULT
 > === END EPILOGUE ===

The uplapw1.err file looks like:

**  Error in Parallel LAPW1
**  LAPW1 STOPPED at Fri Oct 6 09:23:58 CEST 2006
**  check ERROR FILES!
  Cholesky INFO =          4985
  'SECLR4' - POTRF (Scalapack/LAPACK) failed.
  Cholesky INFO =          4997
  'SECLR4' - POTRF (Scalapack/LAPACK) failed.
  Cholesky INFO =          4999
  'SECLR4' - POTRF (Scalapack/LAPACK) failed.

This is indeed strange! I mean, the calculation proceeded the right way and now it cannot go on anymore. So I 
think there is nothing wrong with my system itself. By the way, with a 128 atom supercell this problem appears 
right from the start.

Any suggestions?

Cheers

Roberto

-- 
------------------------------------------
Roberto Iglesias
High Temperature Materials Project
Laboratory for Materials Behaviour
Nuclear Energy and Safety Department
OHLD/013
PAUL SCHERRER INSTITUT
CH-5232 Villigen PSI
phone: +41 (0)56 310 54 81
fax:   +41 (0)56 310 35 65
e-mail: roberto.iglesias at psi.ch
Internet: www.psi.ch
-----------------------------------------