[Wien] MPI parallelization
Griselda Garcia
ggarcia at fis.puc.cl
Thu Apr 29 01:04:34 CEST 2004
Dear WIEN users,
I am trying to use MPI parallel calculation in a PC's cluster (10 pcs of
two processors each one) but the program stops in lapw1 stage and I can
not figure out why. Could you help me, please?
The version of the WIEN program is the last one (I downloaded an
installed the program on April, 21). I compiled the program using ifc
7.1 compiler and scalapack 1.7 libraries.
My .machines files is:
granularity:1
1:fisnode1:2 fisnode2:2 fisnode3:2 fisnode4:2 fisnode5:2 fisnode6:2
fisnode7:2 fisnode8:2 fisnode9:2
lapw0:fisnode1:1 fisnode2:2 fisnode3:2 fisnode4:2 fisnode5:2 fisnode6:2
fisnode7:2 fisnode8:2 fisnode9:2
I launch the program as " runsp_lapw -p "
The case.dayfile is:
Calculating sd_v2 in /home/griselda/WIEN/case/sd_v2 on clustersvr
start (Wed Apr 28 18:38:57 CLT 2004) with lapw0 (20/20 to go)
> lapw0 -p (18:38:57) starting parallel lapw0 at Wed Apr 28
18:38:57 CLT 2004
-------- .machine1 : 17 processors
fisnode1:1
fisnode2:2
fisnode3:2
fisnode4:2
fisnode5:2
fisnode6:2
fisnode7:2
fisnode8:2
fisnode9:2
--------
77.270u 5.390s 2:19.72 59.1% 0+0k 0+0io 34627pf+0w
> lapw1 -c -up -p (18:41:17) starting parallel lapw1 at Wed Apr
28 18:41:17 CLT 2004
-> starting parallel LAPW1 jobs at Wed Apr 28 18:41:17 CLT 2004
Wed Apr 28 18:41:17 CLT 2004 -> Setting up case sd_v2 for parallel execution
Wed Apr 28 18:41:17 CLT 2004 -> of LAPW1
Wed Apr 28 18:41:17 CLT 2004 ->
running LAPW1 in parallel mode (using .machines)
Granularity set to 1
Extrafine unset
Wed Apr 28 18:41:17 CLT 2004 -> klist: 4
Wed Apr 28 18:41:17 CLT 2004 -> machines: fisnode1
Wed Apr 28 18:41:17 CLT 2004 -> procs: 1
Wed Apr 28 18:41:17 CLT 2004 -> weigh(old): 1
Wed Apr 28 18:41:17 CLT 2004 -> sumw: 1
Wed Apr 28 18:41:17 CLT 2004 -> granularity: 1
Wed Apr 28 18:41:17 CLT 2004 -> weigh(new): 4
Wed Apr 28 18:41:17 CLT 2004 -> Splitting sd_v2.klist.tmp into junks
fisnode1:2 fisnode2:2 fisnode3:2 fisnode4:2 fisnode5:2 fisnode6:2
fisnode7:2 fisnode8:2 fisnode9:2
.machinetmp222
1 number_of_parallel_jobs
prepare 1 on fisnode1
Wed Apr 28 18:41:17 CLT 2004 -> Creating klist 1
waiting for all processes to complete
Wed Apr 28 18:41:19 CLT 2004 -> all processes done.
** LAPW1 crashed!
0.100u 0.180s 0:03.33 8.4% 0+0k 0+0io 15805pf+0w
> stop error
These are the things that I do not understand:
1) Why the program is trying to run lapw1 a 1 parallel job?
2) Why do not run the lapw1 if the lapw0 runs perfectly?
3) Is the .machines file ok?
Please .. could you suggest how to get over this difficulty?
Thanks in advance!
Griselda
More information about the Wien
mailing list