[Wien] k-point parallelization in WIEN2K_09.1
Ghosh SUDDHASATTWA
ssghosh at igcar.gov.in
Thu Jun 17 04:46:43 CEST 2010
Dear Prof. Blaha,
We have also successfully made Wien2k (September 2009 version) parallel
using the QMON+++ Parallel Environment Configuration (Intel Core processors)
and a shell script.
We are currently running a case with 18 atoms per unit cell in 32 cpu's
A part of the case.dayfile is
[31] Done ( ( $remote $machine[$p] "cd $PWD;$t
$exe $
{def}_${loop}.def"; rm -f ...
[30] Done ( ( $remote $machine[$p] "cd $PWD;$t
$exe $
{def}_${loop}.def"; rm -f ...
[30] 25568
[32] Done ( ( $remote $machine[$p] "cd $PWD;$t
$exe $
{def}_${loop}.def"; rm -f ...
[29] Done ( ( $remote $machine[$p] "cd $PWD;$t
$exe $
{def}_${loop}.def"; rm -f ...
[29] 25581
[31] 25588
[30] Done ( ( $remote $machine[$p] "cd $PWD;$t
$exe $
{def}_${loop}.def"; rm -f ...
[29] Done ( ( $remote $machine[$p] "cd $PWD;$t
$exe $
{def}_${loop}.def"; rm -f ...
ibnx69(1) 7.654u 0.126s 7.89 98.56% 0+0k 0+0io 0pf+0w
ibnx69(1) 7.521u 0.093s 7.75 98.16% 0+0k 0+0io 0pf+0w
ibnx69(1) 7.565u 0.093s 7.77 98.53% 0+0k 0+0io 0pf+0w
ibnx69(1) 7.711u 0.118s 7.91 98.95% 0+0k 0+0io 0pf+0w
ibnx69(1) 7.561u 0.117s 7.78 98.61% 0+0k 0+0io 0pf+0w
ibnx69(1) 7.443u 0.078s 7.67 97.99% 0+0k 0+0io 0pf+0w
ibnx69(1) 7.766u 0.107s 8.01 98.19% 0+0k 0+0io 0pf+0w
ibnx69(1) 7.399u 0.104s 7.59 98.84% 0+0k 0+0io 0pf+0w
ibnx69(1) 7.268u 0.087s 7.45 98.65% 0+0k 0+0io 0pf+0w
ibnx69(1) 7.500u 0.088s 7.78 97.49% 0+0k 0+0io 0pf+0w
ibnx69(1) 8.221u 0.089s 8.40 98.89% 0+0k 0+0io 0pf+0w
ibnx69(1) 7.846u 0.104s 8.07 98.45% 0+0k 0+0io 0pf+0w
ibnx69(1) 8.023u 0.116s 8.24 98.74% 0+0k 0+0io 0pf+0w
ibnx69(1) 7.187u 0.094s 7.40 98.31% 0+0k 0+0io 0pf+0w
ibnx69(1) 7.634u 0.093s 7.83 98.65% 0+0k 0+0io 0pf+0w
ibnx69(1) 7.520u 0.114s 7.74 98.57% 0+0k 0+0io 0pf+0w
ibnx69(1) 7.539u 0.095s 7.70 99.09% 0+0k 0+0io 0pf+0w
Summary of lapw1para:
ibnx7 k=456 user=6545.89 wallclock=6670.99
ibnx22 k=456 user=6533.73 wallclock=6642.76
ibnx24 k=456 user=38910.8 wallclock=658
ibnx105 k=228 user=2571.41 wallclock=2700.54
ibnx69 k=252 user=1868.27 wallclock=15354.8
2.323u 6.300s 1:23:47.08 0.1% 0+0k 0+0io 0pf+0w
The .machines file was generated dynamically and is shown below
1:ibnx7
1:ibnx7
1:ibnx7
1:ibnx7
1:ibnx7
1:ibnx7
1:ibnx7
1:ibnx7
1:ibnx22
1:ibnx22
1:ibnx22
1:ibnx22
1:ibnx22
1:ibnx22
1:ibnx22
1:ibnx22
1:ibnx24
1:ibnx24
1:ibnx24
1:ibnx24
1:ibnx24
1:ibnx24
1:ibnx24
1:ibnx24
1:ibnx105
1:ibnx105
1:ibnx105
1:ibnx105
1:ibnx69
1:ibnx69
1:ibnx69
1:ibnx69
We are submitting jobs through the SGE
Thank you Prof. Blaha for making such a wonderful program.
Thanks
Suddhasattwa
>
> Kakhaber Jandieri schrieb:
>> Dear prof. Blaha
>>
>>> I do NOT believe that k-point parallel with an older WIEN2k was possible
>>> (unless you set it up with "rsh" instead of "ssh" and defined a
>>> .rhosts file).
>>
>> But it is really possible. I checked again. I even reinstall the
>> WIEN2K_08.1 aiming to verify that the options are the same as that
>> used for WIEN2K_09.1.
>> I did not set "rsh" and did not define ".rhost file.
>> The behaviour of WIEN2K_08.1 is the same (as I described in my
>> previous letters).
>> According to dayfile, k-points are distributed among all reserved
>> nodes. Here is a little fragment of the dayfile:
>>
>>
>> Calculating GaAsB in /home/kakhaber/wien_work/GaAsB
>> on node112 with PID 10597
>>
>> start (Wed Jun 16 09:50:23 CEST 2010) with lapw0 (40/99 to go)
>>
>> cycle 1 (Wed Jun 16 09:50:23 CEST 2010) (40/99 to go)
>>
>>> lapw0 -p (09:50:23) starting parallel lapw0 at Wed Jun 16
>>> 09:50:23 CEST 2010
>> --------
>> running lapw0 in single mode
>> 77.496u 0.628s 1:18.47 99.5% 0+0k 0+7008io 0pf+0w
>>> lapw1 -c -p (09:51:42) starting parallel lapw1 at Wed Jun 16
>>> 09:51:42 CEST 2010
>> -> starting parallel LAPW1 jobs at Wed Jun 16 09:51:42 CEST 2010
>> running LAPW1 in parallel mode (using .machines)
>> 4 number_of_parallel_jobs
>> node112(1) 2091.6u 2.3s 37:18.76 93.5% 0+0k 0+205296io 0pf+0w
>> node105(1) 2024.2u 2.3s 34:26.54 98.0% 0+0k 0+198376io 0pf+0w
>> node122(1) 2115.4u 5.1s 36:08.08 97.8% 0+0k 0+197808io 0pf+0w
>> node131(1) 2041.3u 2.6s 35:19.70 96.4% 0+0k 0+202912io 0pf+0w
>> Summary of lapw1para:
>> node112 k=1 user=2091.6 wallclock=2238.76
>> node105 k=1 user=2024.2 wallclock=2066.54
>> node122 k=1 user=2115.4 wallclock=2168.08
>> node131 k=1 user=2041.3 wallclock=2119.7
>> 8274.113u 15.744s 37:20.53 369.9% 0+0k 8+805440io 0pf+0w
>>> lapw2 -c -p (10:29:02) running LAPW2 in parallel mode
>> node112 87.4u 0.5s 1:37.85 89.9% 0+0k 0+8104io 0pf+0w
>> node105 86.8u 0.9s 1:30.90 96.5% 0+0k 198064+8096io 0pf+0w
>> node122 84.7u 0.6s 1:27.71 97.3% 0+0k 0+8088io 0pf+0w
>> node131 87.9u 1.0s 1:31.00 97.7% 0+0k 0+8088io 0pf+0w
>> Summary of lapw2para:
>> node112 user=87.4 wallclock=97.85
>> node105 user=86.8 wallclock=90.9
>> node122 user=84.7 wallclock=87.71
>> node131 user=87.9 wallclock=91
>> 349.001u 3.592s 1:41.96 345.8% 0+0k 204504+42240io 0pf+0w
>>> lcore (10:30:44) 0.176u 0.060s 0:01.05 21.9% 0+0k 0+5336io 0pf+0w
>>> mixer (10:30:46) 1.436u 0.168s 0:01.99 79.8% 0+0k 0+11920io
0pf+0w
>> :ENERGY convergence: 0 0.001 0
>> :CHARGE convergence: 0 0.001 0
>> ec cc and fc_conv 0 0 0
>>
>> In spite of that, when I login into the nodes, I see the following:
>>
>> node112:~> nice top -c -u kakhaber (this is the master node)
>>
>> top - 10:58:42 up 116 days, 23:19, 1 user, load average: 8.01, 7.77,
7.33
>> Tasks: 110 total, 10 running, 100 sleeping, 0 stopped, 0 zombie
>> Cpu(s): 90.5%us, 0.3%sy, 9.1%ni, 0.0%id, 0.0%wa, 0.0%hi,
>> 0.0%si, 0.0%st
>> Mem: 16542480k total, 16144412k used, 398068k free, 105896k buffers
>> Swap: 4000144k total, 18144k used, 3982000k free, 11430460k cached
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 14474 kakhaber 20 0 937m 918m 2020 R 100 5.7 26:06.48 lapw1c
>> lapw1_4.def
>> 14458 kakhaber 20 0 926m 907m 2020 R 98 5.6 26:01.62 lapw1c
>> lapw1_3.def
>> 14443 kakhaber 20 0 934m 915m 2028 R 98 5.7 25:46.37 lapw1c
>> lapw1_2.def
>> 14428 kakhaber 20 0 936m 917m 2028 R 66 5.7 24:26.43 lapw1c
>> lapw1_1.def
>> 5952 kakhaber 20 0 13724 1360 820 S 0 0.0 0:00.00
>> /bin/tcsh /var/spoo
>> 6077 kakhaber 20 0 3920 736 540 S 0 0.0 0:00.00
>> /bin/csh -f /home/k
>> 10597 kakhaber 20 0 3928 780 572 S 0 0.0 0:00.00
>> /bin/csh -f /home/k
>> 14320 kakhaber 20 0 11252 1180 772 S 0 0.0 0:00.00
>> /bin/tcsh -f /home/
>> 14336 kakhaber 20 0 3920 800 604 S 0 0.0 0:00.62
>> /bin/csh -f /home/k
>> 14427 kakhaber 20 0 3920 440 244 S 0 0.0 0:00.00
>> /bin/csh -f /home/k
>> 14442 kakhaber 20 0 3920 432 236 S 0 0.0 0:00.00
>> /bin/csh -f /home/k
>> 14457 kakhaber 20 0 3920 432 236 S 0 0.0 0:00.00
>> /bin/csh -f /home/k
>> 14472 kakhaber 20 0 3920 432 236 S 0 0.0 0:00.00
>> /bin/csh -f /home/k
>> 16499 kakhaber 20 0 77296 1808 1100 R 0 0.0 0:00.00 sshd:
>> kakhaber at pts/
>> 16500 kakhaber 20 0 16212 2032 1080 S 0 0.0 0:00.02 -tcsh
>> 16603 kakhaber 24 4 10620 1120 848 R 0 0.0 0:00.02 top -c
>> -u kakhaber
>>
>>
>> node105:~> nice top -c -u kakhaber
>> top - 11:01:37 up 116 days, 23:23, 1 user, load average: 3.00, 3.00,
3.00
>> Tasks: 99 total, 3 running, 96 sleeping, 0 stopped, 0 zombie
>> Cpu(s): 2.9%us, 20.6%sy, 49.9%ni, 25.0%id, 0.0%wa, 0.1%hi,
>> 1.6%si, 0.0%st
>> Mem: 16542480k total, 6277020k used, 10265460k free, 233364k buffers
>> Swap: 4000144k total, 13512k used, 3986632k free, 5173312k cached
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 18955 kakhaber 20 0 77296 1808 1100 S 0 0.0 0:00.00 sshd:
>> kakhaber at pts/
>> 18956 kakhaber 20 0 16212 2032 1080 S 0 0.0 0:00.02 -tcsh
>> 19071 kakhaber 24 4 10620 1112 848 R 0 0.0 0:00.00 top -c
>> -u kakhaber
>>
>> for node122 and node131 the output is the same as for node105.
>>
>>> Anyway, k-parallel does not use mpi at all and you have to read the
>>> requirements specified in the UG.
>>
>> I know, but I meant the following: if k-point parallelization in
>> WIEN2K_09.1 does not work because of the problem with
>> interconnection between different nodes, then I thought that the
>> MPI-parallelization also should be impossible. But the MPI-parallel
>> jobs runs without any problem.
>>
>> I suggest one possibility (may be trivial or wrong).
>> In WIEN2K_08.1 k-point parallelization works but all processes run
>> on the master node. In WIEN2K_09.1 k-point parallelization does not
>> work at all. May be there is some restriction in WIEN2K_09.1
>> preventing different k-processes to be ran on the same node and
>> this is the reason of the crash in parallel lapw1?
>>
>> Is such suggestion reasonable?
>> I will be extremely thankful for your additional advices.
>>
>>
>>> Kakhaber Jandieri schrieb:
>>>> Dear Prof. Blaha,
>>>>
>>>> Thank you for your reply.
>>>>
>>>>> Can you ssh node120 ps
>>>>> without supplying a password ?
>>>>
>>>> No, I can't ssh the nodes without password supply, but in my
>>>> parallel_options I have setenv MPI_REMOTE 0. I thought that our
>>>> cluster has a shared memory architecture, since the
>>>> MPI-parallelization works without any problem for 1 k-point. I
>>>> cheeked the corresponding nodes. All they were loaded. May be I
>>>> misunderstood something. Are the requirements for
>>>> MPI-parallelization different from that for k-point paralleization?
>>>>
>>>>> Try x lapw1 -p on the commandline.
>>>>> What exactly is the "error" ?
>>>>
>>>> Just now, to try your suggestions, I ran new task with k-point
>>>> parallelization. The .machines file is:
>>>> granularity:1
>>>> 1:node120
>>>> 1:node127
>>>> 1:node121
>>>> 1:node123
>>>>
>>>> with node120 as a master node.
>>>>
>>>> The output of x lapw -p is:
>>>> starting parallel lapw1 at Sun Jun 13 22:44:08 CEST 2010
>>>> -> starting parallel LAPW1 jobs at Sun Jun 13 22:44:08 CEST 2010
>>>> running LAPW1 in parallel mode (using .machines)
>>>> 4 number_of_parallel_jobs
>>>> [1] 31314
>>>> [2] 31341
>>>> [3] 31357
>>>> [4] 31373
>>>> Permission denied, please try again.
>>>> Permission denied, please try again.
>>>> Received disconnect from 172.26.6.120: 2: Too many authentication
>>>> failures for kakhaber
>>>> [1] Done ( ( $remote $machine[$p] ...
>>>> Permission denied, please try again.
>>>> Permission denied, please try again.
>>>> Received disconnect from 172.26.6.127: 2: Too many authentication
>>>> failures for kakhaber
>>>> Permission denied, please try again.
>>>> Permission denied, please try again.
>>>> Received disconnect from 172.26.6.121: 2: Too many authentication
>>>> failures for kakhaber
>>>> [3] - Done ( ( $remote $machine[$p] ...
>>>> [2] - Done ( ( $remote $machine[$p] ...
>>>> Permission denied, please try again.
>>>> Permission denied, please try again.
>>>> Received disconnect from 172.26.6.123: 2: Too many authentication
>>>> failures for kakhaber
>>>> [4] Done ( ( $remote $machine[$p] ...
>>>> node120(1) node127(1) node121(1) node123(1) **
>>>> LAPW1 crashed!
>>>> cat: No match.
>>>> 0.116u 0.324s 0:11.88 3.6% 0+0k 0+864io 0pf+0w
>>>> error: command /home/kakhaber/WIEN2K_09/lapw1cpara -c lapw1.def
failed
>>>>
>>>>> How many k-points do you have ? ( 4 ?)
>>>>
>>>> Yes, I have 4 k-points.
>>>>
>>>>> Content of .machine1 and .processes
>>>>
>>>> marc-hn:~/wien_work/GaAsB> cat .machine1 node120
>>>> marc-hn:~/wien_work/GaAsB> cat .machine2
>>>> node127
>>>> marc-hn:~/wien_work/GaAsB> cat .machine3
>>>> node121
>>>> marc-hn:~/wien_work/GaAsB> cat .machine4
>>>> node123
>>>>
>>>> marc-hn:~/wien_work/GaAsB> cat .processes
>>>> init:node120
>>>> init:node127
>>>> init:node121
>>>> init:node123
>>>> 1 : node120 : 1 : 1 : 1
>>>> 2 : node127 : 1 : 1 : 2
>>>> 3 : node121 : 1 : 1 : 3
>>>> 4 : node123 : 1 : 1 : 4
>>>>
>>>>> While x lapw1 -p is running, do a ps -ef |grep lapw
>>>>
>>>> I had not enough time to do it - the program crashed before.
>>>>
>>>>> Your .machines file is most likely a rather "useless" one. The
mpi-lapw1
>>>>> diagonalization (SCALAPACK) is almost a factor of 2 slower than
>>>>> the serial
>>>>> version, thus your speedup by using 2 processors in mpi-mode will be
>>>>> very small.
>>>>
>>>> Yes, I know, but I am simply trying to arrange the calculations
>>>> using Wien2K. For "real" calculations I will use much more
>>>> processors.
>>>>
>>>> And finally, for additional information. As I wrote in my
>>>> previous letters, in
>>>> WIEN2k_08.1 k-point parallelization works, but all processes are
>>>> running on master node and all other reserved nodes are idle. I
>>>> forgot to mention: this is true for lapw1 only. Lapw2 is
>>>> distributed among all reserved nodes.
>>>>
>>>> Thank you one again. I am looking forward for your further advices.
>>>>
>>>>
>>>> Dr. Kakhaber Jandieri
>>>> Department of Physics
>>>> Philipps University Marburg
>>>> Tel:+49 6421 2824159 (2825704)
>>>>
>>>>
>>>> _______________________________________________
>>>> Wien mailing list
>>>> Wien at zeus.theochem.tuwien.ac.at
>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>
>>> --
>>> -----------------------------------------
>>> Peter Blaha
>>> Inst. Materials Chemistry, TU Vienna
>>> Getreidemarkt 9, A-1060 Vienna, Austria
>>> Tel: +43-1-5880115671
>>> Fax: +43-1-5880115698
>>> email: pblaha at theochem.tuwien.ac.at
>>> -----------------------------------------
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>>
>>
>> Dr. Kakhaber Jandieri
>> Department of Physics
>> Philipps University Marburg
>> Tel:+49 6421 2824159 (2825704)
>>
>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
> --
>
> P.Blaha
> --------------------------------------------------------------------------
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-15671 FAX: +43-1-58801-15698
> Email: blaha at theochem.tuwien.ac.at WWW:
http://info.tuwien.ac.at/theochem/
> --------------------------------------------------------------------------
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
Dr. Kakhaber Jandieri
Department of Physics
Philipps University Marburg
Tel:+49 6421 2824159 (2825704)
_______________________________________________
Wien mailing list
Wien at zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
More information about the Wien
mailing list