[Wien] problem in k-point parallel job at distributed file system

Sat Aug 19 09:08:32 CEST 2006

Your script cannot work!

As stated in the UG you need a common NFS mounted directory on all nodes 
and your "working directory" must be on this NFS drive. (The files must be 
accessible on all nodes under the same path-name). Most of the 
reads/writes of WIEN2k are very short, with the exception of a few files 
like case.vector* or case.help*.

If you define a SCRATCH variable, then these large files will be 
redirected to the path given in $SCRATCH

Of course, such a path must exist on all nodes (but it should be a local 
(and different) disk on all these nodes.
Thus your script cannot work, because $SCRATCH/ravi1 will not exist on 
all nodes; and of course you MUST NOT change into $RUN:

 # making scratch directory
 dsh mkdir -p $SCRATCH/ravi1
 RUN=$SCRATCH/ravi1

 # Goto scratch
 cd $RUN

In addition, the number of nodes and k-points must fit to each other 
(#k-points / #nodes = integer), otherwise he $SCRATCH trick does not work.

So in essence: Ask your sysadmin for the name of a scratch or 
temp-directory, which is available on all nodes and define in your pbs job 
something like

export SCRATCH=/tmp 

and cd into wour working dir (not SCRATCH !)

> 
> Hello,
> 
> I have job crash problem for the k-point parallel job at distributed file 
> system if I assign scratch disk locally on each node. My job script is
> 
> xxxxxxx
> #!/bin/sh
> #$ -pe mpi 5
> #$ -l s_rt=50:0:0
> #$ -P kjemi
> #$ -l s_vmem=2000M
> #$ -N ravi1
> # Setting up your job-environment
> . /site/bin/jobsetup
> 
> # Setting some variables.
> PATH=$PATH:$HOME/lib/wien2k:.
> WORK=$SGE_O_WORKDIR
> 
> # making scratch directory
> dsh mkdir -p $SCRATCH/ravi1
> RUN=$SCRATCH/ravi1
> 
> # Goto scratch
> cd $RUN
> 
> #Create .machines file
> echo "granularity:1" > .machines
> echo "extrafine:1" >> .machines
> sed 's/com/1:com/g' $TMPDIR/machines >> .machines
> # Copy inpufiles to common scratch
> dsh cp $WORK/*.in* $RUN
> dsh cp $WORK/*.struct $RUN
> dsh cp $WORK/105k.clmsum $RUN/ravi1.clmsum
> dsh cp $WORK/105k.clmup $RUN/ravi1.clmup
> dsh cp $WORK/105k.clmdn $RUN/ravi1.clmdn
> dsh cp $WORK/*.dmat* $RUN/
> dsh cp $WORK/*.kgen* $RUN
> dsh cp $WORK/*.klist* $RUN
> dsh cp $WORK/*.rsp* $RUN
> dsh cp $WORK/*.clm* $RUN
> ###
> runsp_lapw -so -p -cc 0.0001
> save_lapw test1
> mv test1* $WORK
> ###############################
> #END OF SCRIPT
> 
> It created the .machines file
> 
> xxxxxx
> granularity:1
> extrafine:1
> 1:compute-1-34
> 1:compute-1-34
> 1:compute-1-16
> 1:compute-1-29
> 1:compute-1-29
> xxxxx
> 
> This job created $SCRATCH/ravi1 at
> /work/43659.undefined.compute-1-16.kjemi.d/ravi1/
> 
> in all the three nodes (i.e. 34, 16, 29). But the job ran only at two cpus 
> at compute-1-34. Both compute-1-16 and compute-1-29 node were idle. So, 
> the error output show
> 
> ###
>   LAPW0 END
> LAPW1 - Error
> LAPW1 - Error
> LAPW1 - Error
>   LAPW1 END
>   LAPW1 END
> forrtl: severe (24): end-of-file during read, unit 50, file 
> /work/43659.undefined.compute-1-34.kjemi.d/ravi1/ravi1.energysoup_1
> Image              PC                Routine            Line        Source
> lapwdmc            00000000004BC990  Unknown               Unknown 
> Unknown
>   ...
> ..
> 
> #####
> 
> I am unable to solve the problem. I hope somebody modify the above script 
> such that it will work correctly.
> 
> Best regards
> Ravi
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> 

                                      P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-15671             FAX: +43-1-58801-15698
Email: blaha at theochem.tuwien.ac.at    WWW: http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------