[Wien] errors in lapw

Peter Blaha pblaha at theochem.tuwien.ac.at
Fri Feb 3 14:53:03 CET 2012



Clearly you should write your job script such that it divides the 36 k-points in a
"meaningful" way.
In principle you can use 36,18,9,6,4,or 3 parallel jobs, but 16 us not meaningful.

Furthermore, it seems that your cluster has problems with heavy I/O (NFS) and this is
most likely the reason for the observed high load and the crash. Thus I would
i) not use too many cores. Has one node of your cluster really 16 cores, or is this just due
to "multithreading" and in fact it has only 8 ? Do you have enough memory per node ?
ii) try to use a (local) $SCRATCH directory, which reduces the NFS load. But this works only
     if your k-list and .machines file is "compatible" as mentioned above.

It also seems a bit of a bigger calculations (lapw1 took nearly 2h), thus you may either need MPI
or you should not use all cores on one node at your cluster because of memory restrictions.


Am 03.02.2012 13:56, schrieb Bin Shao:
> Dear all,
>
> I am running wien2k 11.1 on a cluster with Centos 6 under a pbs queuing system. The job is submitted in a k-point parallel mode and the total 36 kpoints are divided by 16 cups.
> But there comes some errors in lapw2 and the dnlapw2_18/19/20.error files are not empty. At the same time, the job in pbs system seems dead and can not be killed by the pbs
> command. The administrator check the computing node and command top shows that the node is experiencing very heavy load above 40. Further, ps aux shows that there are 16 lapw2
> processes but not running or say suspended. The jobs caused a heavy load and triggered the self-protection mechanism of the OS, which automatically suspends any running process
> including ssh login except root account.
>
> Any comments will be appreciated and thanks in advanced.
>
> The followings are the error files and case.dayfile.
> --------------------dnlapw2_18/19/20.error------------------
> Error in LAPW2
> ------------------------------------------------------------------------
>
> ---------------------case.output2dn_19------------------------
> ...
>         KVEC(     73563) =   -19   -5    9    9.1046    1
>         KVEC(     73564) =   -19   24   -9    9.1046    1
>         KVEC(     73565) =   -19   24    9    9.1046    1
>         KVEC(     73566) =    19  -24   -9    9.1046    1
>         KVEC(     73567) =    19  -24    9    9.1046    1
>         KVEC(     73568) =    19    5   -9    9.1046    1
>         KVEC(     73569) =    19    5    9    9.1046    1
>         KVE
> ------------------------------------------------------------------------
>
> --------------------case.dayfile-----------------------------------
> ...
> [14]   Done                          ( ( $remote $machine[$p] "cd $PWD;$t $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout2_$loop;
> if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop > .temp2_$loop; grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop | perl -e "print stderr <STDIN>" )
> [9]    Done                          ( ( $remote $machine[$p] "cd $PWD;$t $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout2_$loop;
> if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop > .temp2_$loop; grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop | perl -e "print stderr <STDIN>" )
> [4]    Done                          ( ( $remote $machine[$p] "cd $PWD;$t $exe ${def}_${loop}.def $loop;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout2_$loop;
> if ( -f .stdout2_$loop ) bashtime2csh.pl_lapw .stdout2_$loop > .temp2_$loop; grep \% .temp2_$loop >> .time2_$loop; grep -v \% .temp2_$loop | perl -e "print stderr <STDIN>" )
> [4] 18809
> -----------------------------------------------------------------------------
>
> -----------------------------:log--------------------------------------------
> ...
> Thu Feb  2 17:58:03 CST 2012> (x) lapw1 -c -dn -p -orb
> Thu Feb  2 19:46:53 CST 2012> (x) lapw2 -c -up -p
> Thu Feb  2 19:51:36 CST 2012> (x) sumpara -up -d
> Thu Feb  2 19:52:07 CST 2012> (x) lapw2 -c -dn -p
> --------------------------------------------------------------------------------
>
> (If more information is needed, I will provide.)
>
> Best,
>
> --
> Bin Shao, Ph.D. Candidate
> College of Information Technical Science, Nankai University
> 94 Weijin Rd. Nankai Dist. Tianjin 300071, China
> Email: bshao at mail.nankai.edu.cn <mailto:bshao at mail.nankai.edu.cn>
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WWW: http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------


More information about the Wien mailing list