[Wien] MPI parallelization
Griselda Garcia
ggarcia at fis.puc.cl
Thu Apr 29 19:06:42 CEST 2004
Hello! Kevin,
> Some things you can do :
> *check that the definition file is okay (you know, with the _1 for
some files; in fact, it should look
> exactly like for k-point parallellization)
I did that at first and the uplapw1.def and uplapw1_1.file differ just
in the names of some files: case. klist or case.klist_1, case.outputup
or case.output1up_1, case.vectorup or case.vectorup_1, case.energyup or
case.energyup_1, and case.scfup or case.scf1up_1 ... I think that it is ok.
> *since the klist_1 and def_1 are there, nothing stops you from
launching the job yourself. Actually,
> lapw1para executes
> set ttt=(`echo $mpirun | sed -e "s^_NP_^$number_per_job[$p]^" -e
"s^_EXEC_^${exe}_mpi ${def}_$loop.def^" -e > "s^_HOSTS_^.machine[$p]^"`)
> (cd $PWD;$t $ttt;rm -f .lock_$lockfile[$p]) >>.time1_$loop &
> maybe from this you can work out the necessary command by yourself.
I tried these but the result is the same as before ... lapw1_mpi crashed!
> *To see exactly where things fail, edit lapw1para (vi
$WIENROOT/lapw1para), and in the first line, change
> the /bin/csh-f to /bin/csh-xf. Now run the script again, but capture
all output with nohup (nohup x lapw1 > -p). >You'll now see a large file
nohup.out containing all instructions as executed by the program.
> Check out where it died, and which cat is letting you down.
I am doing this right now.
> * There are other files you may consult. Eg, :parallel, .lapw1para, etc.
I saw these files and they do not give too much information.
[griselda at clustersvr sd_v2]$ more .lapw1para
ERROR
[griselda at clustersvr sd_v2]$ more :parallel
starting parallel lapw1 at Thu Apr 29 12:39:13 CLT 2004
** LAPW1 STOPPED at Thu Apr 29 12:39:16 CLT 2004
** check ERROR FILES!
-----------------------------------------------------------------
The strange thing I saw is in the .processes file ... it is suppose that
lapw1 is running with 18 processors but
[griselda at clustersvr sd_v2]$ more .processes
init:fisnode1
1 : fisnode1 : 4 : 18 : 1
or must be a similar file in each node changing just the name of the
machine??
Sorry about all these questions but I do not know what I can do.
Regards, Griselda.
More information about the Wien
mailing list