[Wien] MPI parallelization

Griselda Garcia ggarcia at fis.puc.cl
Thu Apr 29 19:06:42 CEST 2004


Hello! Kevin,

 > Some things you can do :
 > *check that the definition file is okay (you know, with the _1 for 
some files; in fact, it should look
 > exactly like for k-point parallellization)

I did that at first and the uplapw1.def and uplapw1_1.file differ just 
in the names of some files: case. klist or case.klist_1, case.outputup 
or case.output1up_1, case.vectorup or case.vectorup_1, case.energyup or 
case.energyup_1, and case.scfup or case.scf1up_1 ... I think that it is ok.
 
 > *since the klist_1 and def_1 are there, nothing stops you from 
launching the job yourself. Actually,
 > lapw1para executes
 > set ttt=(`echo $mpirun | sed -e "s^_NP_^$number_per_job[$p]^" -e 
"s^_EXEC_^${exe}_mpi ${def}_$loop.def^" -e > "s^_HOSTS_^.machine[$p]^"`)
 > (cd $PWD;$t $ttt;rm -f .lock_$lockfile[$p]) >>.time1_$loop &
 > maybe from this you can work out the necessary command by yourself.

I tried these but the result is the same as before ... lapw1_mpi  crashed!

 > *To see exactly where things fail, edit lapw1para (vi 
$WIENROOT/lapw1para), and in the first line, change
 > the /bin/csh-f to /bin/csh-xf. Now run the script again, but capture 
all output with nohup (nohup x lapw1 > -p). >You'll now see a large file 
nohup.out containing all instructions as executed by the program.
 > Check out where it died, and which cat is letting you down.

I am doing this right now.

 > * There are other files you may consult. Eg, :parallel, .lapw1para, etc.
I saw these files and they do not give too much information.

[griselda at clustersvr sd_v2]$ more .lapw1para
ERROR
[griselda at clustersvr sd_v2]$ more :parallel
starting parallel lapw1 at Thu Apr 29 12:39:13 CLT 2004
**  LAPW1 STOPPED at Thu Apr 29 12:39:16 CLT 2004
**  check ERROR FILES!
-----------------------------------------------------------------

The strange thing I saw is in the .processes file ... it is suppose that 
lapw1 is running with 18 processors but
[griselda at clustersvr sd_v2]$ more .processes
init:fisnode1
1 : fisnode1 :  4 : 18 : 1

or must be a similar file in each node changing just the name of the 
machine??

Sorry about all these questions but I do not know what I can do.

Regards, Griselda.




More information about the Wien mailing list