[Wien] MPI parallelization

Griselda Garcia ggarcia at fis.puc.cl
Thu Apr 29 01:04:34 CEST 2004


Dear WIEN users,

I am trying to use MPI parallel calculation in a PC's cluster (10 pcs of 
two processors each one) but the program stops in lapw1 stage and I can 
not figure out why. Could you help me, please?

The version of the WIEN program is the last one (I downloaded an 
installed the program on April, 21). I compiled the program using ifc 
7.1 compiler and scalapack 1.7 libraries.

My .machines files is:
granularity:1
1:fisnode1:2 fisnode2:2 fisnode3:2 fisnode4:2 fisnode5:2 fisnode6:2 
fisnode7:2 fisnode8:2 fisnode9:2
lapw0:fisnode1:1 fisnode2:2 fisnode3:2 fisnode4:2 fisnode5:2 fisnode6:2 
fisnode7:2 fisnode8:2 fisnode9:2

I launch the program as " runsp_lapw -p "

The case.dayfile is:
Calculating sd_v2 in /home/griselda/WIEN/case/sd_v2 on clustersvr
                                                                                                                          
    start       (Wed Apr 28 18:38:57 CLT 2004) with lapw0 (20/20 to go)
 >   lapw0 -p    (18:38:57) starting parallel lapw0 at Wed Apr 28 
18:38:57 CLT 2004
-------- .machine1 : 17 processors
fisnode1:1
fisnode2:2
fisnode3:2
fisnode4:2
fisnode5:2
fisnode6:2
fisnode7:2
fisnode8:2
fisnode9:2
--------
77.270u 5.390s 2:19.72 59.1%    0+0k 0+0io 34627pf+0w
 >   lapw1  -c -up -p    (18:41:17) starting parallel lapw1 at Wed Apr 
28 18:41:17 CLT 2004
->  starting parallel LAPW1 jobs at Wed Apr 28 18:41:17 CLT 2004
Wed Apr 28 18:41:17 CLT 2004 -> Setting up case sd_v2 for parallel execution
Wed Apr 28 18:41:17 CLT 2004 -> of LAPW1
Wed Apr 28 18:41:17 CLT 2004 ->
running LAPW1 in parallel mode (using .machines)
Granularity set to 1
Extrafine unset
Wed Apr 28 18:41:17 CLT 2004 -> klist:       4
Wed Apr 28 18:41:17 CLT 2004 -> machines:    fisnode1
Wed Apr 28 18:41:17 CLT 2004 -> procs:       1
Wed Apr 28 18:41:17 CLT 2004 -> weigh(old):  1
Wed Apr 28 18:41:17 CLT 2004 -> sumw:        1
Wed Apr 28 18:41:17 CLT 2004 -> granularity: 1
Wed Apr 28 18:41:17 CLT 2004 -> weigh(new):  4
Wed Apr 28 18:41:17 CLT 2004 -> Splitting sd_v2.klist.tmp into junks
fisnode1:2 fisnode2:2 fisnode3:2 fisnode4:2 fisnode5:2 fisnode6:2 
fisnode7:2 fisnode8:2 fisnode9:2
.machinetmp222
1 number_of_parallel_jobs
prepare 1 on fisnode1
Wed Apr 28 18:41:17 CLT 2004 -> Creating klist 1
waiting for all processes to complete
Wed Apr 28 18:41:19 CLT 2004 -> all processes done.
**  LAPW1 crashed!
0.100u 0.180s 0:03.33 8.4%      0+0k 0+0io 15805pf+0w
                                                                                                                          
 >   stop error


These are the things that I do not understand:
1) Why the program is trying to run lapw1 a 1 parallel job?
2) Why do not run the lapw1 if the lapw0 runs perfectly?
3) Is the .machines file ok?

Please .. could you suggest how to get over this difficulty?

Thanks in advance!

Griselda





More information about the Wien mailing list