[Wien] lapw2 mpi parallelization limits

Peter Blaha pblaha at theochem.tuwien.ac.at
Tue Mar 17 09:04:32 CET 2009


Do you have    TOT   or FOR   in case.in2(c) ??

Does it work with a machines file    compute-0-13 compute-0-19 ...; i.e.
when you are using different nodes (but restricting the total number to 4)

Could it be that your system does not allow to open a file for reading more than 4 times ??


Scott Beardsley schrieb:
> Peter Blaha wrote:
>>> We run routinely on more cpus.
> 
> OK, so it is something with my specific setup. I'm using OpenMPI (in
> case I didn't mention it before). I compiled Wien with the Pathscale
> compiler. I have dual-socket quad-core AMD processors (so, 8 cpus per
> node). I have a QLogic DDR 20Gbps interconnect.
> 
> Here is an example that fails:
> 
> $ mpirun -np 4 -machinefile .machine1 /path/to/wien/lapw2_mpi
> uplapw2_1.def 1
> Daemon [0,0,1] checking in as pid 10183 on host compute-0-13
> [compute-0-13.local:10183] [0,0,1] orted: received launch callback
> [compute-0-13.local:10188] *** An error occurred in MPI_Comm_split
> [compute-0-13.local:10188] *** on communicator MPI_COMM_WORLD
> [compute-0-13.local:10188] *** MPI_ERR_ARG: invalid argument of some
> other kind
> [compute-0-13.local:10188] *** MPI_ERRORS_ARE_FATAL (goodbye)
> [compute-0-13.local:10183] [0,0,1] orted_recv_pls: received message from
> [0,0,0]
> [compute-0-13.local:10183] [0,0,1] orted_recv_pls: received kill_local_procs
> mpirun noticed that job rank 0 with PID 10184 on node compute-0-13
> exited on signal 15 (Terminated).
> 3 additional processes aborted (not shown)
> [compute-0-13.local:10183] [0,0,1] orted_recv_pls: received message from
> [0,0,0]
> [compute-0-13.local:10183] [0,0,1] orted_recv_pls: received exit
> $ echo $?
> 143
> $
> 
>>> A possible test: use a .machines file with
>>> 1:compute-0-13 compute-0-13 compute-0-13 ....
>>> (I have not tested the :8 instructuion, although it should work)
> 
> I changed my .machines to look like this:
> 
> lapw0: compute-0-13 compute-0-13 compute-0-13 compute-0-13 compute-0-13
> compute-0-13 compute-0-13 compute-0-13 compute-0-19 compute-0-19
> compute-0-19 compute-0-19 compute-0-19 compute-0-19 compute-0-19
> compute-0-19
> 1: compute-0-13 compute-0-13 compute-0-13 compute-0-13 compute-0-13
> compute-0-13 compute-0-13 compute-0-13 compute-0-19 compute-0-19
> compute-0-19 compute-0-19 compute-0-19 compute-0-19 compute-0-19
> compute-0-19
> granularity:1
> extrafine:1
> 
> It still crashes.
> 
>>> A possible patch:  lapw2para uses the    ".processes" file
> (generated by
>>>   lapw1 step). You may want to edit it so that lapw2 uses less cpus.
> 
> I'm not sure of the format of these files. Here is what the .processes
> and .processes2 looks like. What do the fields mean? Does it look correct?
> 
> $ cat .processes
> init: compute-0-13 compute-0-13 compute-0-13 compute-0-13 compute-0-13
> compute-0-13 compute-0-13 compute-0-13 compute-0-19 compute-0-19
> compute-0-19 compute-0-19 compute-0-19 compute-0-19 compute-0-19
> compute-0-19
> 1 : compute-0-13 :  84 : 16 : 1
> $ cat .processes2
> 1:1
> compute-0-13
> $
> 
>>> Does the straight command
>>> mpirun -np 16 -machinefile .machine1 /path/to/wien/lapw2_mpi
> uplapw2_1.def 1
>>> work?
> 
> No. It dies just like with 5cpus. I used the "-bynode -np 4" options to
> make sure it is starting on the remote node (it is). I have 8cpus per
> machine and even with one machine I can't seem to use more than 4cpus
> for the lapw2 stage. Very strange.
> 
> When I make my .machines file look as follows everything works great
> (but, of course on only 4cpus):
> 
> lapw0: compute-0-13 compute-0-13 compute-0-13 compute-0-13
> 1: compute-0-13 compute-0-13 compute-0-13 compute-0-13
> granularity:1
> extrafine:1
> 
> I've run an strace using 5 cpus (see attached) and I think the problem
> is during or right after reading the wientest.vspup file. See this
> strace[1] and input file[2].
> 
> BTW, sorry moderators about the large files last time.
> 
> Scott
> --------------
> [1] http://users.cse.ucdavis.edu/~sbeards/wien-strace-5cpus.log
> [2] http://users.cse.ucdavis.edu/~sbeards/wientest.vspup
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-15671             FAX: +43-1-58801-15698
Email: blaha at theochem.tuwien.ac.at    WWW: http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------


More information about the Wien mailing list