[Wien] A problem in a parallel execution of WIEN2k_8.1
oyama
oyama at murata.co.jp
Fri Jan 25 02:48:45 CET 2008
Dear Peter,
I tried a test using a new .machine file with the recommended
syntax(1:xps01 xps01...).
I saw the same error message in STDOUT unfortunately.
In addition, I did the -xf test. A result, I could get a message in
STDOUT (attached in this mail).
But I don't know what I can do for this message.
I hope more advices.
Thanks
Tkashi
Peter Blaha wrote:
>
> I haven't used this syntax for a long time (although it is supposed to work and
> a "correct" .machines file). Please test and change .machines to:
>
> 1:xps01 xps01 xps02 xps02
> ....
>
> Furthermore do the -xf test. The message about "remotemachine" is clearly
> from the lapw2para csh-script and has nothing to do with compilation.
>
> oyama schrieb:
> > Dear Peter,
> >
> > I am running both k-point and mpi-parallel.
> > .machine file is below
> > --
> > 1:xps01:2 xps02:2
> > 1:xps03:2 xps04:2
> > 1:xps05:2 xps06:2
> > 1:xps07:2 xps08:2
> > granularity:1
> > --
> > I did not set "shared-memory machine" during siteconfig_lapw.
> > This is a cause of the incorrect compilation?
> >
> > Thanks
> >
> > Takashi
> >
> >
> >
> >
> >
> > Peter Blaha wrote:
> >> We need much more info.
> >>
> >> Are you running "k-point parallel", or "mpi-parallel" or both together ?
> >> List your .machines file .
> >> Did you set "shared-memory machine" during siteconfig ?
> >>
> >> Change the first line in lapw2para_lapw and add a -xf (instead of -f) switch.
> >> This gives you lots of debugging info.
> >>
> >> oyama schrieb:
> >>> Dear all:
> >>>
> >>> I have a problem in a parallel execution of lapw2. It looks like lapw0
> >>> and lapw1 were completed in parallel environment without any problem,
> >>> but when it proceeds to lapw2, it aborts with an error message:
> >>> --
> >>> LAPW2 - FERMI; weighs written
> >>> remotemachine: Undefined variable.
> >>> remotemachine: Undefined variable.
> >>> remotemachine: Undefined variable.
> >>> remotemachine: Undefined variable.
> >>> cp: cannot stat `.in.tmp': No such file or directory
> >>> rm: cannot remove `.in.tmp': No such file or directory
> >>> rm: cannot remove `.in.tmp1': No such file or directory
> >>> --
> >>>
> >>> I believe hardware and mpich are properly configured and in fact other
> >>> program run in parallel mode without any problem. So, I suppose the
> >>> problem is originated from failure of my compilation of wien2k or
> >>> setting parallel environment specific to wien2k.
> >>>
> >>> Any suggestion or comment is appreciated. Of course, a comment to
> >>> point what I need to examine further is also really appreciated.
> >>>
> >>> Thank you in advance.
> >>>
> >>> Yours sincerely,
> >>>
> >>> Takashi
> >>> _______________________________________________
> >>> Wien mailing list
> >>> Wien at zeus.theochem.tuwien.ac.at
> >>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >> --
> >>
> >> P.Blaha
> >> --------------------------------------------------------------------------
> >> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> >> Phone: +43-1-58801-15671 FAX: +43-1-58801-15698
> >> Email: blaha at theochem.tuwien.ac.at WWW: http://info.tuwien.ac.at/theochem/
> >> --------------------------------------------------------------------------
> >> _______________________________________________
> >> Wien mailing list
> >> Wien at zeus.theochem.tuwien.ac.at
> >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >
>
> --
>
> P.Blaha
> --------------------------------------------------------------------------
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-15671 FAX: +43-1-58801-15698
> Email: blaha at theochem.tuwien.ac.at WWW: http://info.tuwien.ac.at/theochem/
> --------------------------------------------------------------------------
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
--
-------------- next part --------------
LAPW0 END
LAPW1 END
LAPW1 END
LAPW1 END
LAPW1 END
LAPW1 END
LAPW1 END
LAPW1 END
LAPW1 END
LAPW1 END
LAPW1 END
LAPW1 END
LAPW1 END
LAPW1 END
LAPW1 END
LAPW1 END
LAPW1 END
set tmp = .tmp_lapw2para.8309
set tmp2 = .tmp_lapw2para.8309_2
touch .lock_
foreach i ( .lock_* )
rm .lock_
end
onintr exit
set name = /usr/local/WIEN2k_08.1/lapw2para
set bin = /usr/local/WIEN2k_08.1
if ! ( -d /usr/local/WIEN2k_08.1 ) set bin = .
unalias rm
alias testinput if (! -e !:1 || -z !:1) goto !:2
alias testerror if (! -z !:1.error) goto error
alias sortoutput if (-f .stdout!:1) bashtime2csh.pl_lapw .stdout!:1 > .temp!:1; grep \% .temp!:1 >> .time!:1; grep -v \% .temp!:1 | perl -e "print stderr <STDIN>"
set t = time
set log = :parallel
set defmach = `hostname`
hostname
set updn
set dnup = dn
set sc
set cmplx
set eece
set eecem
set vector_split
set EECE
set remote = rsh
set init = init:
set res = residue:
set useremote = 1
set delay = 1
set sleepy = 1
set debug = 0
if ( -e /usr/local/WIEN2k_08.1/parallel_options ) then
source /usr/local/WIEN2k_08.1/parallel_options
setenv USE_REMOTE 1
setenv WIEN_GRANULARITY 1
setenv WIEN_MPIRUN mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_
endif
if ( 1 ) then
set mpirun = mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_
else
if ( 1 ) then
set useremote = 1
endif
if ( 1 < 1 ) then
while ( 1 )
switch ( lapw2.def )
set def = lapw2
shift
breaksw
end
while ( 0 )
set exe = /usr/local/WIEN2k_08.1/lapw2
set exe = lapw2
if ( ! -e .processes || -z .processes ) goto single
echo running LAPW2 in parallel mode
echo RUNNING
if ( -e lapw2.error ) rm *lapw2*.error
if ( -e uplapw2.error ) rm *lapw2*.error
if ( -e dnlapw2.error ) rm *lapw2*.error
if ( -e lapw2_1.error ) rm *lapw2_*.error
if ( -e uplapw2_1.error ) rm *lapw2_*.error
if ( -e dnlapw2_1.error ) rm *lapw2_*.error
if ( -e .time2_1 ) rm .time2_*
if ( -e .machines.help ) rm .machines.help
set vector_split=`grep lapw2_vector_split: .machines |grep -v '#'| cut -d: -f 2`
grep lapw2_vector_split: .machines
grep -v #
cut -d: -f 2
grep -v init: .processes
grep :
grep -v residue:
set mist = `wc $tmp2 `
wc .tmp_lapw2para.8309_2
set maxproc = 4
set machine = `grep $init .processes | cut -f2 -d: | xargs`
grep init: .processes
cut -f2 -d:
xargs
if ( 16 > 4 ) then
set machine = `grep $init .processes |head -$maxproc| cut -f2 -d: | xargs`
grep init: .processes
head -4
cut -f2 -d:
xargs
endif
set lockfile = `cut -f2 -d: $tmp2 | awk '{print $1 NR}'|xargs`
cut -f2 -d: .tmp_lapw2para.8309_2
awk {print $1 NR}
xargs
set residue = `grep $res .processes|cut -f2 -d:`
grep residue: .processes
cut -f2 -d:
if ( == ) unset residue
unset residue
set number_per_job2 = `cut -f4 -d: $tmp2`
cut -f4 -d: .tmp_lapw2para.8309_2
if ( != ) then
set mach = `cut -f5 -d: $tmp2`
cut -f5 -d: .tmp_lapw2para.8309_2
if ( 0 > 0 ) echo machines: xps01 xps01 xps02 xps02 xps03 xps03 xps04 xps04 xps05 xps04 xps06 xps06 xps07 xps07 xps08 xps08
echo ** Error in Parallel LAPW2
setenv PWD `pwd|sed "s/tmp_mnt\///"`
pwd
sed s/tmp_mnt\///
setenv PWD /home/oyama/wien2k/test2
set case = /home/oyama/wien2k/test2
set case = test2
if ( test2 == ) then
if ( 0 > 0 ) echo Setting up case test2 for parallel execution
if ( 0 > 0 ) echo of LAPW2
if ( 0 > 0 ) echo
set fermi = `head -1 $case.in2$cmplx$eece|cut -c-5`
head -1 test2.in2
cut -c-5
if ( TOT == QTL ) then
if ( TOT == EFG ) then
if ( TOT == FERMI ) then
cp test2.in2 .in.tmp
echo FERMI
set len = `wc .in.tmp`
wc .in.tmp
@ len --
tail -14 test2.in2
cp .in.tmp1 test2.in2
echo -> starting Fermi on xps01 at `date`
date
touch test2.weigh_ test2.clmval_1 test2.vrespval_1 test2.help_1 test2.scf2_1
rm test2.weigh_ test2.clmval_1 test2.vrespval_1 test2.help_1 test2.scf2_1
lapw2 lapw2.def 4
LAPW2 - FERMI; weighs written
cp .in.tmp test2.in2
rm .in.tmp .in.tmp1
if ( TOT == FERMI ) then
if ( ! -z lapw2.error ) goto error
if ( 0 > 0 ) echo
if ( 0 > 0 ) echo -n creating lapw2_*.def:
set i = 1
while ( 1 < = 4 )
if ( 0 > 0 ) echo -n 1
cp lapw2.def .tmp
cat
sed -f .script .tmp
sed s/vector_1dn_1/vectordn_1/ .tmp1
sed s/vector_1up_1/vectorup_1/ .tmp2
sed s/vector_1so_1/vectorso_1/ .tmp1
sed s/energy_1up_1/energyup_1/ .tmp2
sed s/energy_1dn_1/energydn_1/ .tmp1
sed s/energy_1so_1/energyso_1/ .tmp2
sed s/energyso_1dn_1/energysodn_1/ .tmp1
sed s/energy_1dum_1/energydum_1/ .tmp2
sed s/vector_1so_1dn_1/vectorsodn_1/ .tmp1
sed s/vector_1dum_1dn_1/vectordumdn_1/ .tmp2
@ i ++
end
while ( 2 < = 4 )
if ( 0 > 0 ) echo -n 2
cp lapw2.def .tmp
cat
sed -f .script .tmp
sed s/vector_2dn_2/vectordn_2/ .tmp1
sed s/vector_2up_2/vectorup_2/ .tmp2
sed s/vector_2so_2/vectorso_2/ .tmp1
sed s/energy_2up_2/energyup_2/ .tmp2
sed s/energy_2dn_2/energydn_2/ .tmp1
sed s/energy_2so_2/energyso_2/ .tmp2
sed s/energyso_2dn_2/energysodn_2/ .tmp1
sed s/energy_2dum_2/energydum_2/ .tmp2
sed s/vector_2so_2dn_2/vectorsodn_2/ .tmp1
sed s/vector_2dum_2dn_2/vectordumdn_2/ .tmp2
@ i ++
end
while ( 3 < = 4 )
if ( 0 > 0 ) echo -n 3
cp lapw2.def .tmp
cat
sed -f .script .tmp
sed s/vector_3dn_3/vectordn_3/ .tmp1
sed s/vector_3up_3/vectorup_3/ .tmp2
sed s/vector_3so_3/vectorso_3/ .tmp1
sed s/energy_3up_3/energyup_3/ .tmp2
sed s/energy_3dn_3/energydn_3/ .tmp1
sed s/energy_3so_3/energyso_3/ .tmp2
sed s/energyso_3dn_3/energysodn_3/ .tmp1
sed s/energy_3dum_3/energydum_3/ .tmp2
sed s/vector_3so_3dn_3/vectorsodn_3/ .tmp1
sed s/vector_3dum_3dn_3/vectordumdn_3/ .tmp2
@ i ++
end
while ( 4 < = 4 )
if ( 0 > 0 ) echo -n 4
cp lapw2.def .tmp
cat
sed -f .script .tmp
sed s/vector_4dn_4/vectordn_4/ .tmp1
sed s/vector_4up_4/vectorup_4/ .tmp2
sed s/vector_4so_4/vectorso_4/ .tmp1
sed s/energy_4up_4/energyup_4/ .tmp2
sed s/energy_4dn_4/energydn_4/ .tmp1
sed s/energy_4so_4/energyso_4/ .tmp2
sed s/energyso_4dn_4/energysodn_4/ .tmp1
sed s/energy_4dum_4/energydum_4/ .tmp2
sed s/vector_4so_4dn_4/vectorsodn_4/ .tmp1
sed s/vector_4dum_4dn_4/vectordumdn_4/ .tmp2
@ i ++
end
while ( 5 < = 4 )
if ( 0 > 0 ) echo
if ( 0 > 0 ) echo
if ( 0 > 0 ) echo starting process:
echo -> starting parallel lapw2 at `date`
date
set loop = 0
set endloop = 0
set runmach =
echo files:4
while ( 0 < 4 )
set p = 1
if ( 0 && 0 ) set p = 2
while ( 1 < = 16 )
if ( 0 < 4 ) then
if ! ( -e .lock_xps011 ) then
@ loop ++
echo 1:4
if ( 0 > 0 ) echo prepare 1 on xps01
set runmach = ( xps01 )
echo xps01
if ( 0 > 1 ) echo > lapw2 lapw2_1.def on xps01
if ( 0 > 1 ) echo > lapw2 lapw2_1.def on xps01
if ( 4 == 1 ) then
if ( 0 > 1 ) echo running parallel lapw2
touch .lock_xps011
echo -n xps01
set ttt= ( `echo $mpirun | sed -e "s^_NP_^$number_per_job2[$loop]^" -e "s^_EXEC_^$WIENROOT/${exe}_mpi ${def}_$loop.def $loop^" -e "s^_HOSTS_^.machine$mach[$loop]^"` )
echo mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_
sed -e s^_NP_^4^ -e s^_EXEC_^/usr/local/WIEN2k_08.1/lapw2_mpi lapw2_1.def 1^ -e s^_HOSTS_^.machine1^
if ( 1 == 1 ) then
set remote = rsh
remotemachine: Undefined variable.
else
endif
endif
if ( 0 > 1 ) echo sleeping for 1 seconds
sleep 1
hostname
jobs -l
endif
@ p ++
end
while ( 2 < = 16 )
if ( 1 < 4 ) then
if ! ( -e .lock_xps012 ) then
@ loop ++
echo 2:4
if ( 0 > 0 ) echo prepare 2 on xps01
set runmach = ( xps01 xps01 )
echo xps01 xps01
if ( 0 > 1 ) echo > lapw2 lapw2_2.def on xps01
if ( 0 > 1 ) echo > lapw2 lapw2_2.def on xps01
if ( 4 == 1 ) then
if ( 0 > 1 ) echo running parallel lapw2
touch .lock_xps012
echo -n xps01
set ttt= ( `echo $mpirun | sed -e "s^_NP_^$number_per_job2[$loop]^" -e "s^_EXEC_^$WIENROOT/${exe}_mpi ${def}_$loop.def $loop^" -e "s^_HOSTS_^.machine$mach[$loop]^"` )
echo mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_
sed -e s^_NP_^4^ -e s^_EXEC_^/usr/local/WIEN2k_08.1/lapw2_mpi lapw2_2.def 2^ -e s^_HOSTS_^.machine2^
if ( 1 == 1 ) then
set remote = rsh
remotemachine: Undefined variable.
else
endif
endif
if ( 0 > 1 ) echo sleeping for 1 seconds
sleep 1
hostname
jobs -l
endif
@ p ++
end
while ( 3 < = 16 )
if ( 2 < 4 ) then
if ! ( -e .lock_xps023 ) then
@ loop ++
echo 3:4
if ( 0 > 0 ) echo prepare 3 on xps02
set runmach = ( xps01 xps01 xps02 )
echo xps01 xps01 xps02
if ( 0 > 1 ) echo > lapw2 lapw2_3.def on xps02
if ( 0 > 1 ) echo > lapw2 lapw2_3.def on xps02
if ( 4 == 1 ) then
if ( 0 > 1 ) echo running parallel lapw2
touch .lock_xps023
echo -n xps02
set ttt= ( `echo $mpirun | sed -e "s^_NP_^$number_per_job2[$loop]^" -e "s^_EXEC_^$WIENROOT/${exe}_mpi ${def}_$loop.def $loop^" -e "s^_HOSTS_^.machine$mach[$loop]^"` )
echo mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_
sed -e s^_NP_^4^ -e s^_EXEC_^/usr/local/WIEN2k_08.1/lapw2_mpi lapw2_3.def 3^ -e s^_HOSTS_^.machine3^
if ( 1 == 1 ) then
set remote = rsh
remotemachine: Undefined variable.
else
endif
endif
if ( 0 > 1 ) echo sleeping for 1 seconds
sleep 1
hostname
jobs -l
endif
@ p ++
end
while ( 4 < = 16 )
if ( 3 < 4 ) then
if ! ( -e .lock_xps024 ) then
@ loop ++
echo 4:4
if ( 0 > 0 ) echo prepare 4 on xps02
set runmach = ( xps01 xps01 xps02 xps02 )
echo xps01 xps01 xps02 xps02
if ( 0 > 1 ) echo > lapw2 lapw2_4.def on xps02
if ( 0 > 1 ) echo > lapw2 lapw2_4.def on xps02
if ( 4 == 1 ) then
if ( 0 > 1 ) echo running parallel lapw2
touch .lock_xps024
echo -n xps02
set ttt= ( `echo $mpirun | sed -e "s^_NP_^$number_per_job2[$loop]^" -e "s^_EXEC_^$WIENROOT/${exe}_mpi ${def}_$loop.def $loop^" -e "s^_HOSTS_^.machine$mach[$loop]^"` )
echo mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_
sed -e s^_NP_^4^ -e s^_EXEC_^/usr/local/WIEN2k_08.1/lapw2_mpi lapw2_4.def 4^ -e s^_HOSTS_^.machine4^
if ( 1 == 1 ) then
set remote = rsh
remotemachine: Undefined variable.
else
endif
endif
if ( 0 > 1 ) echo sleeping for 1 seconds
sleep 1
hostname
jobs -l
endif
@ p ++
end
while ( 5 < = 16 )
if ( 4 < 4 ) then
@ p ++
end
while ( 6 < = 16 )
if ( 4 < 4 ) then
@ p ++
end
while ( 7 < = 16 )
if ( 4 < 4 ) then
@ p ++
end
while ( 8 < = 16 )
if ( 4 < 4 ) then
@ p ++
end
while ( 9 < = 16 )
if ( 4 < 4 ) then
@ p ++
end
while ( 10 < = 16 )
if ( 4 < 4 ) then
@ p ++
end
while ( 11 < = 16 )
if ( 4 < 4 ) then
@ p ++
end
while ( 12 < = 16 )
if ( 4 < 4 ) then
@ p ++
end
while ( 13 < = 16 )
if ( 4 < 4 ) then
@ p ++
end
while ( 14 < = 16 )
if ( 4 < 4 ) then
@ p ++
end
while ( 15 < = 16 )
if ( 4 < 4 ) then
@ p ++
end
while ( 16 < = 16 )
if ( 4 < 4 ) then
@ p ++
end
while ( 17 < = 16 )
end
while ( 4 < 4 )
if ( 0 > 0 ) echo
if ( 0 > 0 ) echo waiting for processes:
wait
sleep 1
set i = 1
while ( 1 < = 4 )
if ( ! -z lapw2_1.error ) goto error
goto error
cp .in.tmp test2.in2
cp: cannot stat `.in.tmp': No such file or directory
rm .in.tmp .in.tmp1
rm: cannot remove `.in.tmp': No such file or directory
rm: cannot remove `.in.tmp1': No such file or directory
echo ** LAPW2 crashed!
echo ** LAPW2 crashed at `date`
date
echo ** check ERROR FILES!
echo -----------------------------------------------------------------
echo ** testerror: Error in Parallel LAPW2
rm .tmp_lapw2para.8309_2
hostname
rm .lapw2para.8309.xps01
echo ERROR
exit 1
> stop error
More information about the Wien
mailing list