[Wien] errors in lapw

Bin Shao binshao1118 at gmail.com
Fri Feb 3 13:56:45 CET 2012


Dear all,

I am running wien2k 11.1 on a cluster with Centos 6 under a pbs queuing
system. The job is submitted in a k-point parallel mode and the total 36
kpoints are divided by 16 cups. But there comes some errors in lapw2 and
the dnlapw2_18/19/20.error files are not empty. At the same time, the job
in pbs system seems dead and can not be killed by the pbs command. The
administrator check the computing node and command top shows that the node
is experiencing very heavy load above 40. Further, ps aux shows that there
are 16 lapw2 processes but not running or say suspended. The jobs caused a
heavy load and triggered the self-protection mechanism of the OS, which
automatically suspends any running process including ssh login except root
account.

Any comments will be appreciated and thanks in advanced.

The followings are the error files and case.dayfile.
--------------------dnlapw2_18/19/20.error------------------
Error in LAPW2
------------------------------------------------------------------------

---------------------case.output2dn_19------------------------
...
       KVEC(     73563) =   -19   -5    9    9.1046    1
       KVEC(     73564) =   -19   24   -9    9.1046    1
       KVEC(     73565) =   -19   24    9    9.1046    1
       KVEC(     73566) =    19  -24   -9    9.1046    1
       KVEC(     73567) =    19  -24    9    9.1046    1
       KVEC(     73568) =    19    5   -9    9.1046    1
       KVEC(     73569) =    19    5    9    9.1046    1
       KVE
------------------------------------------------------------------------

--------------------case.dayfile-----------------------------------
...
[14]   Done                          ( ( $remote $machine[$p] "cd $PWD;$t
$exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f
.lock_$lockfile[$p] ) >& .stdout2_$loop; if ( -f .stdout2_$loop )
bashtime2csh.pl_lapw .stdout2_$loop > .temp2_$loop; grep \% .temp2_$loop >>
.time2_$loop; grep -v \% .temp2_$loop | perl -e "print stderr <STDIN>" )
[9]    Done                          ( ( $remote $machine[$p] "cd $PWD;$t
$exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f
.lock_$lockfile[$p] ) >& .stdout2_$loop; if ( -f .stdout2_$loop )
bashtime2csh.pl_lapw .stdout2_$loop > .temp2_$loop; grep \% .temp2_$loop >>
.time2_$loop; grep -v \% .temp2_$loop | perl -e "print stderr <STDIN>" )
[4]    Done                          ( ( $remote $machine[$p] "cd $PWD;$t
$exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f
.lock_$lockfile[$p] ) >& .stdout2_$loop; if ( -f .stdout2_$loop )
bashtime2csh.pl_lapw .stdout2_$loop > .temp2_$loop; grep \% .temp2_$loop >>
.time2_$loop; grep -v \% .temp2_$loop | perl -e "print stderr <STDIN>" )
[4] 18809
-----------------------------------------------------------------------------

-----------------------------:log--------------------------------------------
...
Thu Feb  2 17:58:03 CST 2012> (x) lapw1 -c -dn -p -orb
Thu Feb  2 19:46:53 CST 2012> (x) lapw2 -c -up -p
Thu Feb  2 19:51:36 CST 2012> (x) sumpara -up -d
Thu Feb  2 19:52:07 CST 2012> (x) lapw2 -c -dn -p
--------------------------------------------------------------------------------

(If more information is needed, I will provide.)

Best,

-- 
Bin Shao, Ph.D. Candidate
College of Information Technical Science, Nankai University
94 Weijin Rd. Nankai Dist. Tianjin 300071, China
Email: bshao at mail.nankai.edu.cn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20120203/e1af9482/attachment-0001.htm>


More information about the Wien mailing list