[Wien] k-point parallelization in WIEN2K_09.1

Wed Jun 16 16:26:13 CEST 2010

No, there was no change!

Did you set "shared memory" ??  This would also explain why everything runs on
one machine ??

Kakhaber Jandieri schrieb:
> Dear prof. Blaha
> 
>> I do NOT believe that k-point parallel with an older WIEN2k was possible
>> (unless you set it up with "rsh" instead of "ssh" and defined a  
>> .rhosts file).
> 
> But it is really possible. I checked again. I even reinstall the 
> WIEN2K_08.1 aiming to verify that the options are the same as that used 
> for WIEN2K_09.1.
> I did not set "rsh" and did not define ".rhost file.
> The behaviour of WIEN2K_08.1 is the same (as I described in my previous 
> letters).
> According to dayfile, k-points are distributed among all reserved nodes. 
> Here is a little fragment of the dayfile:
> 
> 
> Calculating GaAsB in /home/kakhaber/wien_work/GaAsB
> on node112 with PID 10597
> 
>     start     (Wed Jun 16 09:50:23 CEST 2010) with lapw0 (40/99 to go)
> 
>     cycle 1     (Wed Jun 16 09:50:23 CEST 2010)     (40/99 to go)
> 
>>   lapw0 -p    (09:50:23) starting parallel lapw0 at Wed Jun 16 
>> 09:50:23 CEST 2010
> --------
> running lapw0 in single mode
> 77.496u 0.628s 1:18.47 99.5%    0+0k 0+7008io 0pf+0w
>>   lapw1  -c -p     (09:51:42) starting parallel lapw1 at Wed Jun 16 
>> 09:51:42 CEST 2010
> ->  starting parallel LAPW1 jobs at Wed Jun 16 09:51:42 CEST 2010
> running LAPW1 in parallel mode (using .machines)
> 4 number_of_parallel_jobs
>      node112(1) 2091.6u 2.3s 37:18.76 93.5% 0+0k 0+205296io 0pf+0w
>      node105(1) 2024.2u 2.3s 34:26.54 98.0% 0+0k 0+198376io 0pf+0w
>      node122(1) 2115.4u 5.1s 36:08.08 97.8% 0+0k 0+197808io 0pf+0w
>      node131(1) 2041.3u 2.6s 35:19.70 96.4% 0+0k 0+202912io 0pf+0w
>    Summary of lapw1para:
>    node112     k=1     user=2091.6     wallclock=2238.76
>    node105     k=1     user=2024.2     wallclock=2066.54
>    node122     k=1     user=2115.4     wallclock=2168.08
>    node131     k=1     user=2041.3     wallclock=2119.7
> 8274.113u 15.744s 37:20.53 369.9%    0+0k 8+805440io 0pf+0w
>>   lapw2 -c  -p     (10:29:02) running LAPW2 in parallel mode
>       node112 87.4u 0.5s 1:37.85 89.9% 0+0k 0+8104io 0pf+0w
>       node105 86.8u 0.9s 1:30.90 96.5% 0+0k 198064+8096io 0pf+0w
>       node122 84.7u 0.6s 1:27.71 97.3% 0+0k 0+8088io 0pf+0w
>       node131 87.9u 1.0s 1:31.00 97.7% 0+0k 0+8088io 0pf+0w
>    Summary of lapw2para:
>    node112     user=87.4     wallclock=97.85
>    node105     user=86.8     wallclock=90.9
>    node122     user=84.7     wallclock=87.71
>    node131     user=87.9     wallclock=91
> 349.001u 3.592s 1:41.96 345.8%    0+0k 204504+42240io 0pf+0w
>>   lcore    (10:30:44) 0.176u 0.060s 0:01.05 21.9%    0+0k 0+5336io 0pf+0w
>>   mixer    (10:30:46) 1.436u 0.168s 0:01.99 79.8%    0+0k 0+11920io 
>> 0pf+0w
> :ENERGY convergence:  0 0.001 0
> :CHARGE convergence:  0 0.001 0
> ec cc and fc_conv 0 0 0
> 
> In spite of that, when I login into the nodes, I see the following:
> 
> node112:~> nice top -c -u kakhaber (this is the master node)
> 
> top - 10:58:42 up 116 days, 23:19,  1 user,  load average: 8.01, 7.77, 7.33
> Tasks: 110 total,  10 running, 100 sleeping,   0 stopped,   0 zombie
> Cpu(s): 90.5%us,  0.3%sy,  9.1%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  
> 0.0%st
> Mem:  16542480k total, 16144412k used,   398068k free,   105896k buffers
> Swap:  4000144k total,    18144k used,  3982000k free, 11430460k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 14474 kakhaber  20   0  937m 918m 2020 R  100  5.7  26:06.48 lapw1c 
> lapw1_4.def
> 14458 kakhaber  20   0  926m 907m 2020 R   98  5.6  26:01.62 lapw1c 
> lapw1_3.def
> 14443 kakhaber  20   0  934m 915m 2028 R   98  5.7  25:46.37 lapw1c 
> lapw1_2.def
> 14428 kakhaber  20   0  936m 917m 2028 R   66  5.7  24:26.43 lapw1c 
> lapw1_1.def
>  5952 kakhaber  20   0 13724 1360  820 S    0  0.0   0:00.00 /bin/tcsh 
> /var/spoo
>  6077 kakhaber  20   0  3920  736  540 S    0  0.0   0:00.00 /bin/csh -f 
> /home/k
> 10597 kakhaber  20   0  3928  780  572 S    0  0.0   0:00.00 /bin/csh -f 
> /home/k
> 14320 kakhaber  20   0 11252 1180  772 S    0  0.0   0:00.00 /bin/tcsh 
> -f /home/
> 14336 kakhaber  20   0  3920  800  604 S    0  0.0   0:00.62 /bin/csh -f 
> /home/k
> 14427 kakhaber  20   0  3920  440  244 S    0  0.0   0:00.00 /bin/csh -f 
> /home/k
> 14442 kakhaber  20   0  3920  432  236 S    0  0.0   0:00.00 /bin/csh -f 
> /home/k
> 14457 kakhaber  20   0  3920  432  236 S    0  0.0   0:00.00 /bin/csh -f 
> /home/k
> 14472 kakhaber  20   0  3920  432  236 S    0  0.0   0:00.00 /bin/csh -f 
> /home/k
> 16499 kakhaber  20   0 77296 1808 1100 R    0  0.0   0:00.00 sshd: 
> kakhaber at pts/
> 16500 kakhaber  20   0 16212 2032 1080 S    0  0.0   0:00.02 -tcsh
> 16603 kakhaber  24   4 10620 1120  848 R    0  0.0   0:00.02 top -c -u 
> kakhaber
> 
> 
> node105:~> nice top -c -u kakhaber
> top - 11:01:37 up 116 days, 23:23,  1 user,  load average: 3.00, 3.00, 3.00
> Tasks:  99 total,   3 running,  96 sleeping,   0 stopped,   0 zombie
> Cpu(s):  2.9%us, 20.6%sy, 49.9%ni, 25.0%id,  0.0%wa,  0.1%hi,  1.6%si,  
> 0.0%st
> Mem:  16542480k total,  6277020k used, 10265460k free,   233364k buffers
> Swap:  4000144k total,    13512k used,  3986632k free,  5173312k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 18955 kakhaber  20   0 77296 1808 1100 S    0  0.0   0:00.00 sshd: 
> kakhaber at pts/
> 18956 kakhaber  20   0 16212 2032 1080 S    0  0.0   0:00.02 -tcsh
> 19071 kakhaber  24   4 10620 1112  848 R    0  0.0   0:00.00 top -c -u 
> kakhaber
> 
> for node122 and node131    the output is the same as for node105.
> 
>> Anyway, k-parallel does not use mpi at all and you have to read the
>> requirements specified in the UG.
> 
> I know, but I meant the following: if k-point parallelization in 
> WIEN2K_09.1 does not work because of the problem with interconnection 
> between different nodes, then I thought that the MPI-parallelization 
> also should be impossible. But the MPI-parallel jobs runs without any 
> problem.
> 
> I suggest one possibility (may be trivial or wrong).
> In WIEN2K_08.1 k-point parallelization works but all processes run on 
> the master node. In WIEN2K_09.1 k-point parallelization does not work at 
> all. May be there is some restriction in WIEN2K_09.1 preventing 
> different k-processes to be ran on the same node and this is the reason 
> of the crash in parallel lapw1?
> 
> Is such suggestion reasonable?
> I will be extremely thankful for your additional advices.
> 
> 
>> Kakhaber Jandieri schrieb:
>>> Dear Prof. Blaha,
>>>
>>> Thank you for your reply.
>>>
>>>> Can you    ssh node120 ps
>>>> without supplying a password ?
>>>
>>> No, I can't ssh the nodes without password supply, but in my  
>>> parallel_options I have setenv MPI_REMOTE 0. I thought that our  
>>> cluster has a shared memory architecture, since the  
>>> MPI-parallelization works without any problem for 1 k-point. I  
>>> cheeked the corresponding nodes. All they were loaded. May be I  
>>> misunderstood something. Are the requirements for  
>>> MPI-parallelization different from that for k-point paralleization?
>>>
>>>> Try x lapw1 -p on the commandline.
>>>> What exactly is the "error" ?
>>>
>>> Just now, to try your suggestions, I ran new task with k-point  
>>> parallelization. The .machines file is:
>>> granularity:1
>>> 1:node120
>>> 1:node127
>>> 1:node121
>>> 1:node123
>>>
>>> with node120 as a master node.
>>>
>>> The output of x lapw -p is:
>>> starting parallel lapw1 at Sun Jun 13 22:44:08 CEST 2010
>>> ->  starting parallel LAPW1 jobs at Sun Jun 13 22:44:08 CEST 2010
>>> running LAPW1 in parallel mode (using .machines)
>>> 4 number_of_parallel_jobs
>>> [1] 31314
>>> [2] 31341
>>> [3] 31357
>>> [4] 31373
>>> Permission denied, please try again.
>>> Permission denied, please try again.
>>> Received disconnect from 172.26.6.120: 2: Too many authentication  
>>> failures for kakhaber
>>> [1]    Done                   ( ( $remote $machine[$p]  ...
>>> Permission denied, please try again.
>>> Permission denied, please try again.
>>> Received disconnect from 172.26.6.127: 2: Too many authentication  
>>> failures for kakhaber
>>> Permission denied, please try again.
>>> Permission denied, please try again.
>>> Received disconnect from 172.26.6.121: 2: Too many authentication  
>>> failures for kakhaber
>>> [3]  - Done                   ( ( $remote $machine[$p]  ...
>>> [2]  - Done                   ( ( $remote $machine[$p]  ...
>>> Permission denied, please try again.
>>> Permission denied, please try again.
>>> Received disconnect from 172.26.6.123: 2: Too many authentication  
>>> failures for kakhaber
>>> [4]    Done                   ( ( $remote $machine[$p]  ...
>>>     node120(1)      node127(1)      node121(1)      node123(1) **   
>>> LAPW1 crashed!
>>> cat: No match.
>>> 0.116u 0.324s 0:11.88 3.6%        0+0k 0+864io 0pf+0w
>>> error: command   /home/kakhaber/WIEN2K_09/lapw1cpara -c lapw1.def   
>>> failed
>>>
>>>> How many k-points do you have ? ( 4 ?)
>>>
>>> Yes, I have 4 k-points.
>>>
>>>> Content of .machine1 and .processes
>>>
>>> marc-hn:~/wien_work/GaAsB> cat .machine1 node120  
>>> marc-hn:~/wien_work/GaAsB> cat .machine2
>>> node127
>>> marc-hn:~/wien_work/GaAsB> cat .machine3
>>> node121
>>> marc-hn:~/wien_work/GaAsB> cat .machine4
>>> node123
>>>
>>> marc-hn:~/wien_work/GaAsB> cat .processes
>>> init:node120
>>> init:node127
>>> init:node121
>>> init:node123
>>> 1 : node120 :  1 : 1 : 1
>>> 2 : node127 :  1 : 1 : 2
>>> 3 : node121 :  1 : 1 : 3
>>> 4 : node123 :  1 : 1 : 4
>>>
>>>> While x lapw1 -p is running, do a    ps -ef |grep lapw
>>>
>>> I had not enough time to do it - the program crashed before.
>>>
>>>> Your .machines file is most likely a rather "useless" one. The 
>>>> mpi-lapw1
>>>> diagonalization (SCALAPACK) is almost a factor of 2 slower than the 
>>>> serial
>>>> version, thus your speedup by using 2 processors in mpi-mode will be
>>>> very small.
>>>
>>> Yes, I know, but I am simply trying to arrange the calculations  
>>> using Wien2K. For "real" calculations I will use much more  processors.
>>>
>>> And finally, for additional information. As I wrote in my previous  
>>> letters, in
>>> WIEN2k_08.1 k-point parallelization works, but all processes are  
>>> running on master node and all other reserved nodes are idle. I  
>>> forgot to mention: this is true for lapw1 only. Lapw2 is  distributed 
>>> among all reserved nodes.
>>>
>>> Thank you one again. I am looking forward for your further advices.
>>>
>>>
>>> Dr. Kakhaber Jandieri
>>> Department of Physics
>>> Philipps University Marburg
>>> Tel:+49 6421 2824159 (2825704)
>>>
>>>
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>> -- 
>> -----------------------------------------
>> Peter Blaha
>> Inst. Materials Chemistry, TU Vienna
>> Getreidemarkt 9, A-1060 Vienna, Austria
>> Tel: +43-1-5880115671
>> Fax: +43-1-5880115698
>> email: pblaha at theochem.tuwien.ac.at
>> -----------------------------------------
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> 
> 
> 
> Dr. Kakhaber Jandieri
> Department of Physics
> Philipps University Marburg
> Tel:+49 6421 2824159 (2825704)
> 
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-15671             FAX: +43-1-58801-15698
Email: blaha at theochem.tuwien.ac.at    WWW: http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------