[Wien] MPI parallelization

Jorissen Kevin Kevin.Jorissen at ua.ac.be
Thu Apr 29 18:19:23 CEST 2004


Strange.  From your previous e-mail, we learn that lapw1para goes into the main k-loop (that's where it says 'now creating klist 1'), but either sth goes wrong when it tries to execute the MPI command, or it doesn't get there.
The cat error surprises me : the only cat commands out there are for editing the definition file, and I don't see a reason for them to crash.
 
Some things you can do : 
*check that the definition file is okay (you know, with the _1 for some files; in fact, it should look exactly like for k-point parallellization)
*since the klist_1 and def_1 are there, nothing stops you from launching the job yourself.  Actually, lapw1para executes
set ttt=(`echo $mpirun | sed -e "s^_NP_^$number_per_job[$p]^" -e "s^_EXEC_^${exe}_mpi ${def}_$loop.def^" -e "s^_HOSTS_^.machine[$p]^"`)
              (cd $PWD;$t $ttt;rm -f .lock_$lockfile[$p]) >>.time1_$loop &
maybe from this you can work out the necessary command by yourself.
*To see exactly where things fail, edit lapw1para (vi $WIENROOT/lapw1para), and in the first line, change the /bin/csh-f to /bin/csh-xf.  Now run the script again, but capture all output with nohup (nohup x lapw1 -p).  You'll now see a large file nohup.out containing all instructions as executed by the program.
Check out where it died, and which cat is letting you down.
*Wait for a user who knows more about mpi and can help you better.  I confess that I've never actually used mpi parallellization myself - I'm just very familiar with the lapw1para.
* There are other files you may consult.  Eg, :parallel, .lapw1para, etc.
 
Good luck, we'll work this out,
 
Kevin.
 

	-----Original Message----- 
	From: Griselda Garcia [mailto:ggarcia at fis.puc.cl] 
	Sent: Thu 4/29/2004 2:42 PM 
	To: wien at zeus.theochem.tuwien.ac.at 
	Cc: 
	Subject: Re: [Wien] MPI parallelization
	
	

	Hello Kevin,
	
	Thanks for your reply ...
	
	 > As to why lapw1 would not run when lapw0 would ...
	 > * you're sure your input is correct? (ie, x lapw1 works?)
	Yes, I am sure because I have done the same calculation in serial
	version and it finished ok.
	
	 > * has the program been compiled correctly? As many recent e-mails
	show, it's lapw1 which is tricky ...
	The compilation has been correct, neither errors or warnings were obtained.
	
	 > * maybe sth in the setup of your cluster affects lapw1 but not lapw0
	(can't think of anything, though)
	I do not know ... I will talk again with the sys. adm..
	
	 > Could you confirm that lapw1 HAS actually crashed? ie, that the
	partial error files contain an error
	 > message, that the output is clearly not complete ... It seems it
	takes the machine about 2 seconds to
	 > crash, which is not much but enough for a simple test case.
	Yes, lapw1 has actually crashed ... I do not have partial error files
	 ... if I run just the mpi version of lapw1 even with the machines that
	I showed in my previous mail, I have;
	
	[griselda at clustersvr sd_v2]$ lapw1c_mpi uplapw1.def
	 Using            1  processors, My ID =            0
	
	If I run the script runsp_lapw -p and verify the def and error files
	inthe work dir,  I have:
	[griselda at clustersvr sd_v2]$ runsp_lapw -p &
	[1] 2871
	FORTRAN STOP  LAPW0 END
	FORTRAN STOP  LAPW0 END
	FORTRAN STOP  LAPW0 END
	FORTRAN STOP  LAPW0 END
	FORTRAN STOP  LAPW0 END
	FORTRAN STOP  LAPW0 END
	FORTRAN STOP  LAPW0 END
	FORTRAN STOP  LAPW0 END
	FORTRAN STOP  LAPW0 END
	FORTRAN STOP  LAPW0 END
	FORTRAN STOP  LAPW0 END
	FORTRAN STOP  LAPW0 END
	FORTRAN STOP  LAPW0 END
	FORTRAN STOP  LAPW0 END
	FORTRAN STOP  LAPW0 END
	FORTRAN STOP  LAPW0 END
	FORTRAN STOP  LAPW0 END
	cat: No match.
	
	[1]+  Exit 1                  runsp_lapw -p
	
	[griselda at clustersvr sd_v2]$ ls *.def *.error
	lapw0.def  lapw0.error  uplapw1_1.def  uplapw1.def  uplapw1.error
	
	The cluster is configured in such a way that each node mount the server
	home directory using NFS, is it right doing that?
	Which things should the sys adm verify in the set up of the cluster to
	get WIEN running?
	
	Thanks a lot!!
	
	Griselda.
	
	
	_______________________________________________
	Wien mailing list
	Wien at zeus.theochem.tuwien.ac.at
	http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
	
	

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 7622 bytes
Desc: not available
Url : http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20040429/d142d41d/attachment.bin


More information about the Wien mailing list