[Wien] Large memory consumption of MPI k-point parallel version
Oleg Rubel
rubel at Physik.Uni-Marburg.de
Mon Apr 14 13:38:09 CEST 2008
I realized that the origin of my problem is an incorrect scheduling of
MPI-processes in the case of k-parallel MPI-job. As an example I use total
8 nodes split 2x4 according to the .machines file
granularity:1
1:node119 node120 node125 node127
1:node134 node126 node132 node121
lapw0:node119:1 node120:1 node125:1 node127:1 node134:1 node126:1 node132:1 node121:1
The master node is node119. When I login to this node, I see two (?) lapw1
processes running instead of one. At the same time node121 remains idle.
It seems that the master node receives one process from each line
'1:node...'. In my former report with splitting 4x4 nodes this caused the
memory overload.
mpirun command executed on the master node (see results of `top -c -u
rubel` below) seems to be OK
/opt/intel/mpich-1.2.5.3/bin/mpirun -np 4 -machinefile .machine1 /home/ru...
/opt/intel/mpich-1.2.5.3/bin/mpirun -np 4 -machinefile .machine2 /home/ru...
.machineX files (see output below) are correct as well.
I do not understand the reason of such a behavior. Is this a fault of our
queue-system or of Wien2k?
I will be thankful for any pointers.
Oleg Rubel
P.S. Additional info:
marc-hn:~/wien_work/GaAsBeta2_2x4> more .machine*
::::::::::::::
.machine1
::::::::::::::
node119
node120
node125
node127
::::::::::::::
.machine2
::::::::::::::
node134
node126
node132
node121
::::::::::::::
.machines
::::::::::::::
granularity:1
1:node119 node120 node125 node127
1:node134 node126 node132 node121
lapw0:node119:1 node120:1 node125:1 node127:1 node134:1 node126:1 node132:1 node121:1
node119:~> nice top -c -u rubel
Tasks: 135 total, 6 running, 129 sleeping, 0 stopped, 0 zombie
Cpu(s): 75.0%us, 0.0%sy, 25.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16526700k total, 6365176k used, 10161524k free, 178912k buffers
Swap: 4000144k total, 4004k used, 3996140k free, 1652268k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9405 rubel 20 0 1823m 1.7g 3336 R 100 10.6 5:32.57 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi lapw1_1.def -p4pg /home/rubel/wien_w
9558 rubel 20 0 1811m 1.7g 3340 R 100 10.5 5:44.22 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi lapw1_2.def -p4pg /home/rubel/wien_w
9174 rubel 20 0 2784 736 548 S 0 0.0 0:00.17 /bin/csh -f /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1cpara -c lapw1.def
9830 rubel 20 0 48016 1884 1180 R 0 0.0 0:00.01 sshd: rubel at pts/0
10144 rubel 24 4 6632 1204 852 R 0 0.0 0:00.02 top -c -u rubel
8559 rubel 20 0 2780 716 552 S 0 0.0 0:00.01 /bin/csh /var/spool/sge/node119/job_scripts/1523496
8680 rubel 20 0 2784 676 500 S 0 0.0 0:00.00 /bin/csh -f /home/rubel/WIEN2k_v08.mkl_10_mpi/min -i 100 -s 10 -j run_lapw -p -I
8717 rubel 20 0 2792 732 536 S 0 0.0 0:00.00 /bin/csh -f /home/rubel/WIEN2k_v08.mkl_10_mpi/run_lapw -p -I -i 40 -fc 0.5 -ec 0.
9158 rubel 20 0 7336 1268 832 S 0 0.0 0:00.01 /bin/tcsh -f /home/rubel/WIEN2k_v08.mkl_10_mpi/x lapw1 -c -p
9276 rubel 20 0 2784 476 264 S 0 0.0 0:00.00 /bin/csh -f /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1cpara -c lapw1.def
9277 rubel 20 0 6436 1744 1052 S 0 0.0 0:00.02 /bin/sh /opt/intel/mpich-1.2.5.3/bin/mpirun -np 4 -machinefile .machine1 /home/ru
9406 rubel 20 0 49908 4012 1452 S 0 0.0 0:00.00 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi lapw1_1.def -p4pg /home/rubel/wien_w
9407 rubel 20 0 11264 1648 1356 S 0 0.0 0:00.00 /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node120 /home/rubel/WIEN2
9424 rubel 20 0 6788 736 616 S 0 0.0 0:00.00 /usr/local/sge/utilbin/lx24-amd64/rsh -n -p 48265 node120 exec '/usr/local/sge/ut
9433 rubel 20 0 2784 464 252 S 0 0.0 0:00.00 /bin/csh -f /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1cpara -c lapw1.def
9434 rubel 20 0 6436 1740 1052 S 0 0.0 0:00.02 /bin/sh /opt/intel/mpich-1.2.5.3/bin/mpirun -np 4 -machinefile .machine2 /home/ru
9559 rubel 20 0 49908 4004 1452 S 0 0.0 0:00.00 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi lapw1_2.def -p4pg /home/rubel/wien_w
9560 rubel 20 0 11264 1652 1356 S 0 0.0 0:00.00 /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node134 /home/rubel/WIEN2
9568 rubel 20 0 6788 736 616 S 0 0.0 0:00.00 /usr/local/sge/utilbin/lx24-amd64/rsh -n -p 52336 node134 exec '/usr/local/sge/ut
9569 rubel 20 0 11264 1652 1356 S 0 0.0 0:00.00 /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node125 /home/rubel/WIEN2
9577 rubel 20 0 11264 1652 1356 S 0 0.0 0:00.00 /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node126 /home/rubel/WIEN2
9579 rubel 20 0 6788 740 616 S 0 0.0 0:00.00 /usr/local/sge/utilbin/lx24-amd64/rsh -n -p 57921 node125 exec '/usr/local/sge/ut
9586 rubel 20 0 6788 744 616 S 0 0.0 0:00.00 /usr/local/sge/utilbin/lx24-amd64/rsh -n -p 34563 node126 exec '/usr/local/sge/ut
9588 rubel 20 0 11264 1652 1356 S 0 0.0 0:00.00 /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node132 /home/rubel/WIEN2
9596 rubel 20 0 11264 1652 1356 S 0 0.0 0:00.00 /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node127 /home/rubel/WIEN2
9604 rubel 20 0 6788 740 616 S 0 0.0 0:00.00 /usr/local/sge/utilbin/lx24-amd64/rsh -n -p 59617 node132 exec '/usr/local/sge/ut
9605 rubel 20 0 6788 740 616 S 0 0.0 0:00.00 /usr/local/sge/utilbin/lx24-amd64/rsh -n -p 34361 node127 exec '/usr/local/sge/ut
9831 rubel 20 0 10660 2496 1064 S 0 0.0 0:00.03 -tcsh
10150 rubel 20 0 2648 436 356 S 0 0.0 0:00.00 sleep 1
marc-hn:~/wien_work/GaAsBeta2_2x4> more .time1_*
::::::::::::::
.time1_1
::::::::::::::
node119 node120 node125 node127(6) /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node120 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi node119 52553 \-p4amslave \-p4yourname node120 \-p4rmrank 1
/usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node125 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi node119 52553 \-p4amslave \-p4yourname node125 \-p4rmrank 2
/usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node127 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi node119 52553 \-p4amslave \-p4yourname node127 \-p4rmrank 3
Using 4 processors
::::::::::::::
.time1_2
::::::::::::::
node134 node126 node132 node121(6) /usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node134 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi node119 55820 \-p4amslave \-p4yourname node134 \-p4rmrank 1
/usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node126 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi node119 55820 \-p4amslave \-p4yourname node126 \-p4rmrank 2
/usr/local/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin node132 /home/rubel/WIEN2k_v08.mkl_10_mpi/lapw1c_mpi node119 55820 \-p4amslave \-p4yourname node132 \-p4rmrank 3
Using 4 processors
More information about the Wien
mailing list