[Wien] problem in parallel mode calculation
Gavin Abo
gsabo at crimson.ua.edu
Tue Mar 14 04:25:05 CET 2017
The .machines file looks fine to me, but one of the others might see
something that I didn't notice (besides the WIEN2k command not being
there at the bottom of the file - likely missed in the copy and paste).
The main problem seems to the "bash: lapw1: command not found" unless
something happened earlier that is not shown. Tracking down parallel
error messages is more complicated. Unlike a serial calculation that
can output the standard output and error to the display of a terminal on
a desktop, a parallel calculation on a cluster with a queue system can
put them in a standard output (-o) and standard error file (-e) or a
combined output/error file (-j) with user specified name(s) [1,2]. They
can also be written to the hidden dot files like .time* or .stdout* as
mentioned before [3,4,5].
The "lapw1: command not found" might be because $WIENROOT didn't get
added to the PATH on one of the nodes [
http://www.supercluster.org/pipermail/torqueusers/2010-March/010143.html
]. Did you try checking if the path to WIEN2k is in the PATH, such as
PBS_O_PATH with qstat -f jobid [
http://stackoverflow.com/questions/21248406/sleep-command-not-found-in-torque-pbs-but-works-in-shell
].
Did you try to ssh into all 8 nodes and see if you can see lapw1 on each
node? For example,
ssh n024
ls -l $WIENROOT/lapw1
ssh n225
ls -l $WIENROOT/lapw1
...
Above, I'm just guessing about the commands/configuration for your
system, but the administrator or helpdesk for your cluster should know
everything about your system and be able to help you much better with
resolving the command not found error.
[1] http://beige.ucs.indiana.edu/I590/node39.html
[2]
https://wikis.nyu.edu/display/NYUHPC/Tutorial+-+Submitting+a+job+using+qsub
[3]
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg13598.html
[4]
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg14148.html
[5] http://zeus.theochem.tuwien.ac.at/pipermail/wien/2017-March/026109.html
On 3/13/2017 1:25 PM, shaymlal dayananda wrote:
> Dear developers and users
>
> I was trying to do a volume optimization and scf calculation with spin
> polarization in parallel mode. But my both the jobs crashes and I got
> the following error file. However both cases run correctly when
> parallel mode is removed.
> ............................................................................
> 'LAPW2' - can't open unit: 30
> 'LAPW2' - filename: case.energyup_1
> ** testerror: Error in Parallel LAPW2
> .................................................................................
> Also in STDOUT , I see the following particular errors. (
>
> .......................................................................
> bash: lapw1: command not found
> ...
> ....
> .....
> FERMI - Error
> grep: *scf1dn*: No such file or directory
> 0.381u 0.507s 1:12.66 1.2% 0+0k 128+1736io 1pf+0w
> Test-TiC-VOl-parallel.scf1dn_1: No such file or directory.
> .............................................................................
>
>
> I copied my machine file and the job file here. But I think this is
> not correct and I am not sure whether I needs to have lines for lapw2
> and lapwsp separately. Any help to get corrected this is highly
> appreciated.
>
> ".machnes" file
> .............................
> #
> lapw0:n024 n225 n220 n218 n045 n044 n043 n043
> 1:n024
> 1:n225
> 1:n220
> 1:n218
> 1:n045
> 1:n044
> 1:n043
> 1:n043
> granularity:1
> extrafine:1
>
> ......................................................
>
> job file is copied below.
>
>
> # example for 8 nodes
> #PBS -l procs=8
> #PBS -l pmem=2048mb
> #PBS -l walltime=4:00:00
>
> module load wien2k
>
> # change into your working directory
> cd $PBS_O_WORKDIR
> #start creating .machines
> cat $PBS_NODEFILE |cut -c1-6 >.machines_current
> aa=`cat .machines_current | wc -l`
> echo '#' > .machines
>
> # example for an MPI parallel lapw0
> echo -n 'lapw0:' >> .machines
> i=1
> while [ $i -lt $aa ]
> do
> echo -n `cat $PBS_NODEFILE |head -$i | tail -1` ' ' >>.machines
> i=$((i+1))
> done
> echo `cat $PBS_NODEFILE |head -$i|tail -1` ' ' >>.machines
>
> #example for k-point parallel lapw1/2
> i=1
> while [ $i -le $aa ]
> do
> echo -n '1:' >>.machines
> head -$i .machines_current |tail -1 >> .machines
> i=$((i+1))
> done
>
> echo 'granularity:1' >>.machines
> echo 'extrafine:1' >>.machines
>
> #define here your WIEN2k command
>
>
> ....................................................................
>
>
> Thank you
>
> Chami
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20170313/9f1348a8/attachment.html>
More information about the Wien
mailing list