[Wien] MPI parallelization
Jorissen Kevin
Kevin.Jorissen at ua.ac.be
Thu Apr 29 02:43:04 CEST 2004
your machines file is fine.
the 'number of parallel jobs' reported by wien, refers to k-point parallellization. In your case, only one job will start, shared by all machines, so you get 1. If you'd have made a machines-file containing two lines, each listing half of the machines, then the k-list would have been split into two parts, each executed by half of your machines in MPI, and wien would say that you have two jobs. etc.
As to why lapw1 would not run when lapw0 would ...
* you're sure your input is correct? (ie, x lapw1 works?)
* has the program been compiled correctly? As many recent e-mails show, it's lapw1 which is tricky ...
* maybe sth in the setup of your cluster affects lapw1 but not lapw0 (can't think of anything, though)
Could you confirm that lapw1 HAS actually crashed? ie, that the partial error files contain an error message, that the output is clearly not complete ... It seems it takes the machine about 2 seconds to crash, which is not much but enough for a simple test case.
If it hasn't, then there's probably some subtle error somewhere, fooling the main script into thinking there's a problem ...
I don't think I've helped you much, but anyway ... Good luck!
Kevin.
-----Original Message-----
From: Griselda Garcia [mailto:ggarcia at fis.puc.cl]
Sent: Thu 4/29/2004 1:04 AM
To: wien at zeus.theochem.tuwien.ac.at
Cc:
Subject: [Wien] MPI parallelization
Dear WIEN users,
I am trying to use MPI parallel calculation in a PC's cluster (10 pcs of
two processors each one) but the program stops in lapw1 stage and I can
not figure out why. Could you help me, please?
The version of the WIEN program is the last one (I downloaded an
installed the program on April, 21). I compiled the program using ifc
7.1 compiler and scalapack 1.7 libraries.
My .machines files is:
granularity:1
1:fisnode1:2 fisnode2:2 fisnode3:2 fisnode4:2 fisnode5:2 fisnode6:2
fisnode7:2 fisnode8:2 fisnode9:2
lapw0:fisnode1:1 fisnode2:2 fisnode3:2 fisnode4:2 fisnode5:2 fisnode6:2
fisnode7:2 fisnode8:2 fisnode9:2
I launch the program as " runsp_lapw -p "
The case.dayfile is:
Calculating sd_v2 in /home/griselda/WIEN/case/sd_v2 on clustersvr
start (Wed Apr 28 18:38:57 CLT 2004) with lapw0 (20/20 to go)
> lapw0 -p (18:38:57) starting parallel lapw0 at Wed Apr 28
18:38:57 CLT 2004
-------- .machine1 : 17 processors
fisnode1:1
fisnode2:2
fisnode3:2
fisnode4:2
fisnode5:2
fisnode6:2
fisnode7:2
fisnode8:2
fisnode9:2
--------
77.270u 5.390s 2:19.72 59.1% 0+0k 0+0io 34627pf+0w
> lapw1 -c -up -p (18:41:17) starting parallel lapw1 at Wed Apr
28 18:41:17 CLT 2004
-> starting parallel LAPW1 jobs at Wed Apr 28 18:41:17 CLT 2004
Wed Apr 28 18:41:17 CLT 2004 -> Setting up case sd_v2 for parallel execution
Wed Apr 28 18:41:17 CLT 2004 -> of LAPW1
Wed Apr 28 18:41:17 CLT 2004 ->
running LAPW1 in parallel mode (using .machines)
Granularity set to 1
Extrafine unset
Wed Apr 28 18:41:17 CLT 2004 -> klist: 4
Wed Apr 28 18:41:17 CLT 2004 -> machines: fisnode1
Wed Apr 28 18:41:17 CLT 2004 -> procs: 1
Wed Apr 28 18:41:17 CLT 2004 -> weigh(old): 1
Wed Apr 28 18:41:17 CLT 2004 -> sumw: 1
Wed Apr 28 18:41:17 CLT 2004 -> granularity: 1
Wed Apr 28 18:41:17 CLT 2004 -> weigh(new): 4
Wed Apr 28 18:41:17 CLT 2004 -> Splitting sd_v2.klist.tmp into junks
fisnode1:2 fisnode2:2 fisnode3:2 fisnode4:2 fisnode5:2 fisnode6:2
fisnode7:2 fisnode8:2 fisnode9:2
.machinetmp222
1 number_of_parallel_jobs
prepare 1 on fisnode1
Wed Apr 28 18:41:17 CLT 2004 -> Creating klist 1
waiting for all processes to complete
Wed Apr 28 18:41:19 CLT 2004 -> all processes done.
** LAPW1 crashed!
0.100u 0.180s 0:03.33 8.4% 0+0k 0+0io 15805pf+0w
> stop error
These are the things that I do not understand:
1) Why the program is trying to run lapw1 a 1 parallel job?
2) Why do not run the lapw1 if the lapw0 runs perfectly?
3) Is the .machines file ok?
Please .. could you suggest how to get over this difficulty?
Thanks in advance!
Griselda
_______________________________________________
Wien mailing list
Wien at zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 9018 bytes
Desc: not available
Url : http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20040429/4b7806d8/attachment.bin
More information about the Wien
mailing list