[Wien] MPI parallelization

Jorissen Kevin Kevin.Jorissen at ua.ac.be
Thu Apr 29 02:43:04 CEST 2004


your machines file is fine.
the 'number of parallel jobs' reported by wien, refers to k-point parallellization.  In your case, only one job will start, shared by all machines, so you get 1.  If you'd have made a machines-file containing two lines, each listing half of the machines, then the k-list would have been split into two parts, each executed by half of your machines in MPI, and wien would say that you have two jobs.  etc.
 
As to why lapw1 would not run when lapw0 would ...
* you're sure your input is correct?  (ie, x lapw1 works?)
* has the program been compiled correctly?  As many recent e-mails show, it's lapw1 which is tricky ...
* maybe sth in the setup of your cluster affects lapw1 but not lapw0 (can't think of anything, though)
 
Could you confirm that lapw1 HAS actually crashed?  ie, that the partial error files contain an error message, that the output is clearly not complete ...   It seems it takes the machine about 2 seconds to crash, which is not much but enough for a simple test case.
If it hasn't, then there's probably some subtle error somewhere, fooling the main script into thinking there's a problem  ...
 
 
I don't think I've helped you much, but anyway ...  Good luck!
 
Kevin.
 

	-----Original Message----- 
	From: Griselda Garcia [mailto:ggarcia at fis.puc.cl] 
	Sent: Thu 4/29/2004 1:04 AM 
	To: wien at zeus.theochem.tuwien.ac.at 
	Cc: 
	Subject: [Wien] MPI parallelization
	
	

	Dear WIEN users,
	
	I am trying to use MPI parallel calculation in a PC's cluster (10 pcs of
	two processors each one) but the program stops in lapw1 stage and I can
	not figure out why. Could you help me, please?
	
	The version of the WIEN program is the last one (I downloaded an
	installed the program on April, 21). I compiled the program using ifc
	7.1 compiler and scalapack 1.7 libraries.
	
	My .machines files is:
	granularity:1
	1:fisnode1:2 fisnode2:2 fisnode3:2 fisnode4:2 fisnode5:2 fisnode6:2
	fisnode7:2 fisnode8:2 fisnode9:2
	lapw0:fisnode1:1 fisnode2:2 fisnode3:2 fisnode4:2 fisnode5:2 fisnode6:2
	fisnode7:2 fisnode8:2 fisnode9:2
	
	I launch the program as " runsp_lapw -p "
	
	The case.dayfile is:
	Calculating sd_v2 in /home/griselda/WIEN/case/sd_v2 on clustersvr
	                                                                                                                         
	    start       (Wed Apr 28 18:38:57 CLT 2004) with lapw0 (20/20 to go)
	 >   lapw0 -p    (18:38:57) starting parallel lapw0 at Wed Apr 28
	18:38:57 CLT 2004
	-------- .machine1 : 17 processors
	fisnode1:1
	fisnode2:2
	fisnode3:2
	fisnode4:2
	fisnode5:2
	fisnode6:2
	fisnode7:2
	fisnode8:2
	fisnode9:2
	--------
	77.270u 5.390s 2:19.72 59.1%    0+0k 0+0io 34627pf+0w
	 >   lapw1  -c -up -p    (18:41:17) starting parallel lapw1 at Wed Apr
	28 18:41:17 CLT 2004
	->  starting parallel LAPW1 jobs at Wed Apr 28 18:41:17 CLT 2004
	Wed Apr 28 18:41:17 CLT 2004 -> Setting up case sd_v2 for parallel execution
	Wed Apr 28 18:41:17 CLT 2004 -> of LAPW1
	Wed Apr 28 18:41:17 CLT 2004 ->
	running LAPW1 in parallel mode (using .machines)
	Granularity set to 1
	Extrafine unset
	Wed Apr 28 18:41:17 CLT 2004 -> klist:       4
	Wed Apr 28 18:41:17 CLT 2004 -> machines:    fisnode1
	Wed Apr 28 18:41:17 CLT 2004 -> procs:       1
	Wed Apr 28 18:41:17 CLT 2004 -> weigh(old):  1
	Wed Apr 28 18:41:17 CLT 2004 -> sumw:        1
	Wed Apr 28 18:41:17 CLT 2004 -> granularity: 1
	Wed Apr 28 18:41:17 CLT 2004 -> weigh(new):  4
	Wed Apr 28 18:41:17 CLT 2004 -> Splitting sd_v2.klist.tmp into junks
	fisnode1:2 fisnode2:2 fisnode3:2 fisnode4:2 fisnode5:2 fisnode6:2
	fisnode7:2 fisnode8:2 fisnode9:2
	.machinetmp222
	1 number_of_parallel_jobs
	prepare 1 on fisnode1
	Wed Apr 28 18:41:17 CLT 2004 -> Creating klist 1
	waiting for all processes to complete
	Wed Apr 28 18:41:19 CLT 2004 -> all processes done.
	**  LAPW1 crashed!
	0.100u 0.180s 0:03.33 8.4%      0+0k 0+0io 15805pf+0w
	                                                                                                                         
	 >   stop error
	
	
	These are the things that I do not understand:
	1) Why the program is trying to run lapw1 a 1 parallel job?
	2) Why do not run the lapw1 if the lapw0 runs perfectly?
	3) Is the .machines file ok?
	
	Please .. could you suggest how to get over this difficulty?
	
	Thanks in advance!
	
	Griselda
	
	
	_______________________________________________
	Wien mailing list
	Wien at zeus.theochem.tuwien.ac.at
	http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
	
	

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 9018 bytes
Desc: not available
Url : http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20040429/4b7806d8/attachment.bin


More information about the Wien mailing list