[Wien] Large memory consumption of MPI k-point parallel version

Oleg Rubel rubel at Physik.Uni-Marburg.de
Mon Apr 14 13:38:09 CEST 2008


I realized that the origin of my problem is an incorrect scheduling of 
MPI-processes in the case of k-parallel MPI-job. As an example I use total 
8 nodes split 2x4 according to the .machines file

     granularity:1
     1:node119 node120 node125 node127
     1:node134 node126 node132 node121
     lapw0:node119:1 node120:1 node125:1 node127:1 node134:1 node126:1 node132:1 node121:1

The master node is node119. When I login to this node, I see two (?) lapw1 
processes running instead of one. At the same time node121 remains idle. 
It seems that the master node receives one process from each line 
'1:node...'. In my former report with splitting 4x4 nodes this caused the 
memory overload.

mpirun command executed on the master node (see results of `top -c -u 
rubel` below) seems to be OK

     /opt/intel/mpich-1.2.5.3/bin/mpirun -np 4 -machinefile .machine1 /home/ru...
     /opt/intel/mpich-1.2.5.3/bin/mpirun -np 4 -machinefile .machine2 /home/ru...

.machineX files (see output below) are correct as well.

I do not understand the reason of such a behavior. Is this a fault of our 
queue-system or of Wien2k?

I will be thankful for any pointers.

Oleg Rubel


P.S. Additional info:

     marc-hn:~/wien_work/GaAsBeta2_2x4> more .machine*
     ::::::::::::::
     .machine1
     ::::::::::::::
     node119
     node120
     node125
     node127
     ::::::::::::::
     .machine2
     ::::::::::::::
     node134
     node126
     node132
     node121
     ::::::::::::::
     .machines
     ::::::::::::::
     granularity:1
     1:node119 node120 node125 node127
     1:node134 node126 node132 node121
     lapw0:node119:1 node120:1 node125:1 node127:1 node134:1 node126:1 node132:1 node121:1


     node119:~> nice top -c -u rubel
     Tasks: 135 total,   6 running, 129 sleeping,   0 stopped,   0 zombie
     Cpu(s): 75.0%us,  0.0%sy, 25.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
     Mem:  16526700k total,  6365176k used, 10161524k free,   178912k buffers
     Swap:  4000144k total,     4004k used,  3996140k free,  1652268k cached

       PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
      9405 rubel     20   0 1823m 1.7g 3336 R  100 10.6   5:32.57 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi lapw1_1.def -p4pg /home/rubel/wien_w
      9558 rubel     20   0 1811m 1.7g 3340 R  100 10.5   5:44.22 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi lapw1_2.def -p4pg /home/rubel/wien_w
      9174 rubel     20   0  2784  736  548 S    0  0.0   0:00.17 /bin/csh -f /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1cpara -c lapw1.def
      9830 rubel     20   0 48016 1884 1180 R    0  0.0   0:00.01 sshd: rubel at pts/0
     10144 rubel     24   4  6632 1204  852 R    0  0.0   0:00.02 top -c -u rubel
      8559 rubel     20   0  2780  716  552 S    0  0.0   0:00.01 /bin/csh /var/spool/sge/node119/job_scripts/1523496
      8680 rubel     20   0  2784  676  500 S    0  0.0   0:00.00 /bin/csh -f /home/rubel/WIEN2k_v08.mkl_10_mpi/min -i 100 -s 10 -j run_lapw -p -I
      8717 rubel     20   0  2792  732  536 S    0  0.0   0:00.00 /bin/csh -f /home/rubel/WIEN2k_v08.mkl_10_mpi/run_lapw -p -I -i 40 -fc 0.5 -ec 0.
      9158 rubel     20   0  7336 1268  832 S    0  0.0   0:00.01 /bin/tcsh -f /home/rubel/WIEN2k_v08.mkl_10_mpi/x lapw1 -c -p
      9276 rubel     20   0  2784  476  264 S    0  0.0   0:00.00 /bin/csh -f /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1cpara -c lapw1.def
      9277 rubel     20   0  6436 1744 1052 S    0  0.0   0:00.02 /bin/sh /opt/intel/mpich-1.2.5.3/bin/mpirun -np 4 -machinefile .machine1 /home/ru
      9406 rubel     20   0 49908 4012 1452 S    0  0.0   0:00.00 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi lapw1_1.def -p4pg /home/rubel/wien_w
      9407 rubel     20   0 11264 1648 1356 S    0  0.0   0:00.00 /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node120 /home/rubel/WIEN2
      9424 rubel     20   0  6788  736  616 S    0  0.0   0:00.00 /usr/local/sge/utilbin/lx24-amd64/rsh -n -p 48265 node120 exec '/usr/local/sge/ut
      9433 rubel     20   0  2784  464  252 S    0  0.0   0:00.00 /bin/csh -f /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1cpara -c lapw1.def
      9434 rubel     20   0  6436 1740 1052 S    0  0.0   0:00.02 /bin/sh /opt/intel/mpich-1.2.5.3/bin/mpirun -np 4 -machinefile .machine2 /home/ru
      9559 rubel     20   0 49908 4004 1452 S    0  0.0   0:00.00 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi lapw1_2.def -p4pg /home/rubel/wien_w
      9560 rubel     20   0 11264 1652 1356 S    0  0.0   0:00.00 /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node134 /home/rubel/WIEN2
      9568 rubel     20   0  6788  736  616 S    0  0.0   0:00.00 /usr/local/sge/utilbin/lx24-amd64/rsh -n -p 52336 node134 exec '/usr/local/sge/ut
      9569 rubel     20   0 11264 1652 1356 S    0  0.0   0:00.00 /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node125 /home/rubel/WIEN2
      9577 rubel     20   0 11264 1652 1356 S    0  0.0   0:00.00 /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node126 /home/rubel/WIEN2
      9579 rubel     20   0  6788  740  616 S    0  0.0   0:00.00 /usr/local/sge/utilbin/lx24-amd64/rsh -n -p 57921 node125 exec '/usr/local/sge/ut
      9586 rubel     20   0  6788  744  616 S    0  0.0   0:00.00 /usr/local/sge/utilbin/lx24-amd64/rsh -n -p 34563 node126 exec '/usr/local/sge/ut
      9588 rubel     20   0 11264 1652 1356 S    0  0.0   0:00.00 /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node132 /home/rubel/WIEN2
      9596 rubel     20   0 11264 1652 1356 S    0  0.0   0:00.00 /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node127 /home/rubel/WIEN2
      9604 rubel     20   0  6788  740  616 S    0  0.0   0:00.00 /usr/local/sge/utilbin/lx24-amd64/rsh -n -p 59617 node132 exec '/usr/local/sge/ut
      9605 rubel     20   0  6788  740  616 S    0  0.0   0:00.00 /usr/local/sge/utilbin/lx24-amd64/rsh -n -p 34361 node127 exec '/usr/local/sge/ut
      9831 rubel     20   0 10660 2496 1064 S    0  0.0   0:00.03 -tcsh
     10150 rubel     20   0  2648  436  356 S    0  0.0   0:00.00 sleep 1


     marc-hn:~/wien_work/GaAsBeta2_2x4> more .time1_*
     ::::::::::::::
     .time1_1
     ::::::::::::::
     node119 node120 node125 node127(6) /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node120 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi node119 52553 \-p4amslave \-p4yourname node120 \-p4rmrank 1
     /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node125 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi node119 52553 \-p4amslave \-p4yourname node125 \-p4rmrank 2
     /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node127 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi node119 52553 \-p4amslave \-p4yourname node127 \-p4rmrank 3
     Using    4 processors
     ::::::::::::::
     .time1_2
     ::::::::::::::
     node134 node126 node132 node121(6) /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node134 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi node119 55820 \-p4amslave \-p4yourname node134 \-p4rmrank 1
     /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node126 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi node119 55820 \-p4amslave \-p4yourname node126 \-p4rmrank 2
     /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node132 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi node119 55820 \-p4amslave \-p4yourname node132 \-p4rmrank 3
     Using    4 processors


More information about the Wien mailing list