[Wien] lapw2 mpi parallelization limits
Peter Blaha
pblaha at theochem.tuwien.ac.at
Tue Mar 17 09:04:32 CET 2009
Do you have TOT or FOR in case.in2(c) ??
Does it work with a machines file compute-0-13 compute-0-19 ...; i.e.
when you are using different nodes (but restricting the total number to 4)
Could it be that your system does not allow to open a file for reading more than 4 times ??
Scott Beardsley schrieb:
> Peter Blaha wrote:
>>> We run routinely on more cpus.
>
> OK, so it is something with my specific setup. I'm using OpenMPI (in
> case I didn't mention it before). I compiled Wien with the Pathscale
> compiler. I have dual-socket quad-core AMD processors (so, 8 cpus per
> node). I have a QLogic DDR 20Gbps interconnect.
>
> Here is an example that fails:
>
> $ mpirun -np 4 -machinefile .machine1 /path/to/wien/lapw2_mpi
> uplapw2_1.def 1
> Daemon [0,0,1] checking in as pid 10183 on host compute-0-13
> [compute-0-13.local:10183] [0,0,1] orted: received launch callback
> [compute-0-13.local:10188] *** An error occurred in MPI_Comm_split
> [compute-0-13.local:10188] *** on communicator MPI_COMM_WORLD
> [compute-0-13.local:10188] *** MPI_ERR_ARG: invalid argument of some
> other kind
> [compute-0-13.local:10188] *** MPI_ERRORS_ARE_FATAL (goodbye)
> [compute-0-13.local:10183] [0,0,1] orted_recv_pls: received message from
> [0,0,0]
> [compute-0-13.local:10183] [0,0,1] orted_recv_pls: received kill_local_procs
> mpirun noticed that job rank 0 with PID 10184 on node compute-0-13
> exited on signal 15 (Terminated).
> 3 additional processes aborted (not shown)
> [compute-0-13.local:10183] [0,0,1] orted_recv_pls: received message from
> [0,0,0]
> [compute-0-13.local:10183] [0,0,1] orted_recv_pls: received exit
> $ echo $?
> 143
> $
>
>>> A possible test: use a .machines file with
>>> 1:compute-0-13 compute-0-13 compute-0-13 ....
>>> (I have not tested the :8 instructuion, although it should work)
>
> I changed my .machines to look like this:
>
> lapw0: compute-0-13 compute-0-13 compute-0-13 compute-0-13 compute-0-13
> compute-0-13 compute-0-13 compute-0-13 compute-0-19 compute-0-19
> compute-0-19 compute-0-19 compute-0-19 compute-0-19 compute-0-19
> compute-0-19
> 1: compute-0-13 compute-0-13 compute-0-13 compute-0-13 compute-0-13
> compute-0-13 compute-0-13 compute-0-13 compute-0-19 compute-0-19
> compute-0-19 compute-0-19 compute-0-19 compute-0-19 compute-0-19
> compute-0-19
> granularity:1
> extrafine:1
>
> It still crashes.
>
>>> A possible patch: lapw2para uses the ".processes" file
> (generated by
>>> lapw1 step). You may want to edit it so that lapw2 uses less cpus.
>
> I'm not sure of the format of these files. Here is what the .processes
> and .processes2 looks like. What do the fields mean? Does it look correct?
>
> $ cat .processes
> init: compute-0-13 compute-0-13 compute-0-13 compute-0-13 compute-0-13
> compute-0-13 compute-0-13 compute-0-13 compute-0-19 compute-0-19
> compute-0-19 compute-0-19 compute-0-19 compute-0-19 compute-0-19
> compute-0-19
> 1 : compute-0-13 : 84 : 16 : 1
> $ cat .processes2
> 1:1
> compute-0-13
> $
>
>>> Does the straight command
>>> mpirun -np 16 -machinefile .machine1 /path/to/wien/lapw2_mpi
> uplapw2_1.def 1
>>> work?
>
> No. It dies just like with 5cpus. I used the "-bynode -np 4" options to
> make sure it is starting on the remote node (it is). I have 8cpus per
> machine and even with one machine I can't seem to use more than 4cpus
> for the lapw2 stage. Very strange.
>
> When I make my .machines file look as follows everything works great
> (but, of course on only 4cpus):
>
> lapw0: compute-0-13 compute-0-13 compute-0-13 compute-0-13
> 1: compute-0-13 compute-0-13 compute-0-13 compute-0-13
> granularity:1
> extrafine:1
>
> I've run an strace using 5 cpus (see attached) and I think the problem
> is during or right after reading the wientest.vspup file. See this
> strace[1] and input file[2].
>
> BTW, sorry moderators about the large files last time.
>
> Scott
> --------------
> [1] http://users.cse.ucdavis.edu/~sbeards/wien-strace-5cpus.log
> [2] http://users.cse.ucdavis.edu/~sbeards/wientest.vspup
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
--
P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-15671 FAX: +43-1-58801-15698
Email: blaha at theochem.tuwien.ac.at WWW: http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------
More information about the Wien
mailing list