[Wien] problem in k-point parallel job at distributed file system

Ravindran Ponniah ravindran.ponniah at kjemi.uio.no
Fri Aug 18 18:44:15 CEST 2006


Hello,

I have job crash problem for the k-point parallel job at distributed file 
system if I assign scratch disk locally on each node. My job script is

xxxxxxx
#!/bin/sh
#$ -pe mpi 5
#$ -l s_rt=50:0:0
#$ -P kjemi
#$ -l s_vmem=2000M
#$ -N ravi1
# Setting up your job-environment
. /site/bin/jobsetup

# Setting some variables.
PATH=$PATH:$HOME/lib/wien2k:.
WORK=$SGE_O_WORKDIR

# making scratch directory
dsh mkdir -p $SCRATCH/ravi1
RUN=$SCRATCH/ravi1

# Goto scratch
cd $RUN

#Create .machines file
echo "granularity:1" > .machines
echo "extrafine:1" >> .machines
sed 's/com/1:com/g' $TMPDIR/machines >> .machines
# Copy inpufiles to common scratch
dsh cp $WORK/*.in* $RUN
dsh cp $WORK/*.struct $RUN
dsh cp $WORK/105k.clmsum $RUN/ravi1.clmsum
dsh cp $WORK/105k.clmup $RUN/ravi1.clmup
dsh cp $WORK/105k.clmdn $RUN/ravi1.clmdn
dsh cp $WORK/*.dmat* $RUN/
dsh cp $WORK/*.kgen* $RUN
dsh cp $WORK/*.klist* $RUN
dsh cp $WORK/*.rsp* $RUN
dsh cp $WORK/*.clm* $RUN
###
runsp_lapw -so -p -cc 0.0001
save_lapw test1
mv test1* $WORK
###############################
#END OF SCRIPT

It created the .machines file

xxxxxx
granularity:1
extrafine:1
1:compute-1-34
1:compute-1-34
1:compute-1-16
1:compute-1-29
1:compute-1-29
xxxxx

This job created $SCRATCH/ravi1 at
/work/43659.undefined.compute-1-16.kjemi.d/ravi1/

in all the three nodes (i.e. 34, 16, 29). But the job ran only at two cpus 
at compute-1-34. Both compute-1-16 and compute-1-29 node were idle. So, 
the error output show

###
  LAPW0 END
LAPW1 - Error
LAPW1 - Error
LAPW1 - Error
  LAPW1 END
  LAPW1 END
forrtl: severe (24): end-of-file during read, unit 50, file 
/work/43659.undefined.compute-1-34.kjemi.d/ravi1/ravi1.energysoup_1
Image              PC                Routine            Line        Source
lapwdmc            00000000004BC990  Unknown               Unknown 
Unknown
  ...
..

#####

I am unable to solve the problem. I hope somebody modify the above script 
such that it will work correctly.

Best regards
Ravi


More information about the Wien mailing list