[Wien] problem in k-point parallel job at distributed file system
Peter Blaha
pblaha at theochem.tuwien.ac.at
Sat Aug 19 09:08:32 CEST 2006
Your script cannot work!
As stated in the UG you need a common NFS mounted directory on all nodes
and your "working directory" must be on this NFS drive. (The files must be
accessible on all nodes under the same path-name). Most of the
reads/writes of WIEN2k are very short, with the exception of a few files
like case.vector* or case.help*.
If you define a SCRATCH variable, then these large files will be
redirected to the path given in $SCRATCH
Of course, such a path must exist on all nodes (but it should be a local
(and different) disk on all these nodes.
Thus your script cannot work, because $SCRATCH/ravi1 will not exist on
all nodes; and of course you MUST NOT change into $RUN:
# making scratch directory
dsh mkdir -p $SCRATCH/ravi1
RUN=$SCRATCH/ravi1
# Goto scratch
cd $RUN
In addition, the number of nodes and k-points must fit to each other
(#k-points / #nodes = integer), otherwise he $SCRATCH trick does not work.
So in essence: Ask your sysadmin for the name of a scratch or
temp-directory, which is available on all nodes and define in your pbs job
something like
export SCRATCH=/tmp
and cd into wour working dir (not SCRATCH !)
>
> Hello,
>
> I have job crash problem for the k-point parallel job at distributed file
> system if I assign scratch disk locally on each node. My job script is
>
> xxxxxxx
> #!/bin/sh
> #$ -pe mpi 5
> #$ -l s_rt=50:0:0
> #$ -P kjemi
> #$ -l s_vmem=2000M
> #$ -N ravi1
> # Setting up your job-environment
> . /site/bin/jobsetup
>
> # Setting some variables.
> PATH=$PATH:$HOME/lib/wien2k:.
> WORK=$SGE_O_WORKDIR
>
> # making scratch directory
> dsh mkdir -p $SCRATCH/ravi1
> RUN=$SCRATCH/ravi1
>
> # Goto scratch
> cd $RUN
>
> #Create .machines file
> echo "granularity:1" > .machines
> echo "extrafine:1" >> .machines
> sed 's/com/1:com/g' $TMPDIR/machines >> .machines
> # Copy inpufiles to common scratch
> dsh cp $WORK/*.in* $RUN
> dsh cp $WORK/*.struct $RUN
> dsh cp $WORK/105k.clmsum $RUN/ravi1.clmsum
> dsh cp $WORK/105k.clmup $RUN/ravi1.clmup
> dsh cp $WORK/105k.clmdn $RUN/ravi1.clmdn
> dsh cp $WORK/*.dmat* $RUN/
> dsh cp $WORK/*.kgen* $RUN
> dsh cp $WORK/*.klist* $RUN
> dsh cp $WORK/*.rsp* $RUN
> dsh cp $WORK/*.clm* $RUN
> ###
> runsp_lapw -so -p -cc 0.0001
> save_lapw test1
> mv test1* $WORK
> ###############################
> #END OF SCRIPT
>
> It created the .machines file
>
> xxxxxx
> granularity:1
> extrafine:1
> 1:compute-1-34
> 1:compute-1-34
> 1:compute-1-16
> 1:compute-1-29
> 1:compute-1-29
> xxxxx
>
> This job created $SCRATCH/ravi1 at
> /work/43659.undefined.compute-1-16.kjemi.d/ravi1/
>
> in all the three nodes (i.e. 34, 16, 29). But the job ran only at two cpus
> at compute-1-34. Both compute-1-16 and compute-1-29 node were idle. So,
> the error output show
>
> ###
> LAPW0 END
> LAPW1 - Error
> LAPW1 - Error
> LAPW1 - Error
> LAPW1 END
> LAPW1 END
> forrtl: severe (24): end-of-file during read, unit 50, file
> /work/43659.undefined.compute-1-34.kjemi.d/ravi1/ravi1.energysoup_1
> Image PC Routine Line Source
> lapwdmc 00000000004BC990 Unknown Unknown
> Unknown
> ...
> ..
>
> #####
>
> I am unable to solve the problem. I hope somebody modify the above script
> such that it will work correctly.
>
> Best regards
> Ravi
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-15671 FAX: +43-1-58801-15698
Email: blaha at theochem.tuwien.ac.at WWW: http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------
More information about the Wien
mailing list