[Wien] lapw2 mpi parallelization limits

Scott Beardsley scott at cse.ucdavis.edu
Mon Mar 16 23:16:40 CET 2009


Peter Blaha wrote:
> > We run routinely on more cpus.

OK, so it is something with my specific setup. I'm using OpenMPI (in
case I didn't mention it before). I compiled Wien with the Pathscale
compiler. I have dual-socket quad-core AMD processors (so, 8 cpus per
node). I have a QLogic DDR 20Gbps interconnect.

Here is an example that fails:

$ mpirun -np 4 -machinefile .machine1 /path/to/wien/lapw2_mpi
uplapw2_1.def 1
Daemon [0,0,1] checking in as pid 10183 on host compute-0-13
[compute-0-13.local:10183] [0,0,1] orted: received launch callback
[compute-0-13.local:10188] *** An error occurred in MPI_Comm_split
[compute-0-13.local:10188] *** on communicator MPI_COMM_WORLD
[compute-0-13.local:10188] *** MPI_ERR_ARG: invalid argument of some
other kind
[compute-0-13.local:10188] *** MPI_ERRORS_ARE_FATAL (goodbye)
[compute-0-13.local:10183] [0,0,1] orted_recv_pls: received message from
[0,0,0]
[compute-0-13.local:10183] [0,0,1] orted_recv_pls: received kill_local_procs
mpirun noticed that job rank 0 with PID 10184 on node compute-0-13
exited on signal 15 (Terminated).
3 additional processes aborted (not shown)
[compute-0-13.local:10183] [0,0,1] orted_recv_pls: received message from
[0,0,0]
[compute-0-13.local:10183] [0,0,1] orted_recv_pls: received exit
$ echo $?
143
$

> > A possible test: use a .machines file with
> > 1:compute-0-13 compute-0-13 compute-0-13 ....
> > (I have not tested the :8 instructuion, although it should work)

I changed my .machines to look like this:

lapw0: compute-0-13 compute-0-13 compute-0-13 compute-0-13 compute-0-13
compute-0-13 compute-0-13 compute-0-13 compute-0-19 compute-0-19
compute-0-19 compute-0-19 compute-0-19 compute-0-19 compute-0-19
compute-0-19
1: compute-0-13 compute-0-13 compute-0-13 compute-0-13 compute-0-13
compute-0-13 compute-0-13 compute-0-13 compute-0-19 compute-0-19
compute-0-19 compute-0-19 compute-0-19 compute-0-19 compute-0-19
compute-0-19
granularity:1
extrafine:1

It still crashes.

> > A possible patch:  lapw2para uses the    ".processes" file
(generated by
> >   lapw1 step). You may want to edit it so that lapw2 uses less cpus.

I'm not sure of the format of these files. Here is what the .processes
and .processes2 looks like. What do the fields mean? Does it look correct?

$ cat .processes
init: compute-0-13 compute-0-13 compute-0-13 compute-0-13 compute-0-13
compute-0-13 compute-0-13 compute-0-13 compute-0-19 compute-0-19
compute-0-19 compute-0-19 compute-0-19 compute-0-19 compute-0-19
compute-0-19
1 : compute-0-13 :  84 : 16 : 1
$ cat .processes2
1:1
compute-0-13
$

> > Does the straight command
> > mpirun -np 16 -machinefile .machine1 /path/to/wien/lapw2_mpi
uplapw2_1.def 1
> >
> > work?

No. It dies just like with 5cpus. I used the "-bynode -np 4" options to
make sure it is starting on the remote node (it is). I have 8cpus per
machine and even with one machine I can't seem to use more than 4cpus
for the lapw2 stage. Very strange.

When I make my .machines file look as follows everything works great
(but, of course on only 4cpus):

lapw0: compute-0-13 compute-0-13 compute-0-13 compute-0-13
1: compute-0-13 compute-0-13 compute-0-13 compute-0-13
granularity:1
extrafine:1

I've run an strace using 5 cpus (see attached) and I think the problem
is during or right after reading the wientest.vspup file. See this
strace[1] and input file[2].

BTW, sorry moderators about the large files last time.

Scott
--------------
[1] http://users.cse.ucdavis.edu/~sbeards/wien-strace-5cpus.log
[2] http://users.cse.ucdavis.edu/~sbeards/wientest.vspup


More information about the Wien mailing list