[Wien] parallel under sge environment

Oleg Rubel rubelo at tbh.net
Tue Apr 20 23:28:10 CEST 2010


I have a few suggestions (hopefully constructive) based on my experience.

Before you run Wien2k over MPI:

1) make sure that you can run "mpirun/mpiexec histname" from SGE script. This should give a list of all allocated hosts. If the result makes sense,

2) explain your "mpirun/mpiexec" command that you want to run on a subset of hosts. This can depend on a particular MPI implementation, but the command is similar to "mpiexec -machinefile hostsfile -n 4 hostname". The hostsfile contains a subset list of hosts. Say, you allocate 8 hosts: host[1-8], then this file will contain host[2-5]. Make sure that the result of the mpirun/mpiexec returns the same list of hosts as in hostsfile. Sometimes, mpirun/mpiexec tends to run the first process on a master node ignoring your hostsfile. Find a way to enforce that, since it is critical for k-parallel + MPI jobs.

If this works, run Wien2k step-by-step:

3) take a simple case or MPI benchmark and try to run "mpirun/mpiexec $WIENROOT/lapw1(c)_mpi lapw1.def" using SGE. If successful,

4) find a case with 2 k-points and try to run "mpiexec -machinefile hostsfile1 -n 4 $WIENROOT/lapw1(c)_mpi lapw1_1.def" and "mpiexec -machinefile hostsfile2 -n 4 $WIENROOT/lapw1(c)_mpi lapw1_2.def" by spliting allocated hosts evenly between hostsfile1 and hostsfile2. (Prepare *def files, if they are not ready, and make sure that you went though DSTART and LAPW0.) When job is running, it is good to login to hosts and make sure that all processes run on dedicated hosts.

If this work, you need to adjust WIEN_MPIRUN in $WIENROOT/parallel_options file according to your finding in step 2 and disable ssh/rsh as suggested by Prof. Marks and you are ready to go :)


I hope this will help,

Oleg

--
Oleg Rubel, PhD
Scientist, Thunder Bay Regional Research Institute
Adjunct Professor, Dept Physics, Lakehead University
290 Munro St, Thunder Bay, P7A 7T1, Ontario, Canada
Phone: +1-807-7663350
Fax: +1-807-3441948
E-mail: rubelo at tbh.net
Homepage: http://www.tbrri.com/~orubel/
>>> zhaoyh <yhzhao.mail at gmail.com> 04/20/10 3:29 PM >>>
Hello Prof. Blaha and Marks,

The submitting script and the error message have been attached. 

The "host" and "hosts" pe are not usable right now. The only one I can
use is mpi.

Thanks for your help.

Regards,

yonghong
On Tue, 2010-04-20 at 16:33 +0200, Peter Blaha wrote:
> Still not clear:
> 
> > "I cannot use ssh" means that this supercomputer doesn't allow users to
> > log in to the compute node directly. I have consulted the admin already.
> > He just ask me to use sge script to submit job. The attachment is the
> 
> It is "normal" that you cannot ssh to the compute node FROM the login node.
> So you will never be able to type in
>       ssh nodexxx
> but this is NOT necessary anyway!
> 
> Have you tried to adapt one of the job scripts at the faq-page of www.wien2k.at
> and after creation of the machines file, put    run_lapw -p into the sge script ??
> 
> It is not helpful to show the PWSCF script, show the WIEN2k script you have tried.
> Anyway from your script I can see:
> 
> #$ -pe mpi 160      # 4 slots (allocated among the available hosts)
> ##$ -pe host 6         # 6 slots (allocated on a single host max=8)
> ##$ -pe hosts 16       # 8 slots per host. (numbers of cores should be a multiple of 8)
> 
> Most likely you need to uncomment the last line (and comment the first one), if you do not
> want to use mpi. At least it indicates that you have different "pe" environments available.
> 
> Then you need some lines, which generates   .machines  from the nodes assigned to you.
> (See templates mentioned above, or you said, that you already have that)
> 
> 
> mpirun -np 160  pwscf -npool 16 < input > out
> 
> Instead of that line, you put     run_lapw -p
> 
> 
> My experience says, that users who cannot handle k-parallelism, will not be
> able to run mpi-parallel, because this is much more difficult.
> 
> 
> 




More information about the Wien mailing list