[Wien] k-point parallelization in WIEN2K_09.1

Fri Jun 18 18:29:42 CEST 2010

Dear Prof. Blaha

setenv USE_REMOTE 1  solved all existed problems.
k-point parallelization runs successfully in WIEN2K_09.1 with correct  
distribution of the tasks between the reserved nodes.
MPI-parallelization is also OK.

Thank you very much.

> Shared memory can be used ONLY if you have only ONE machine with
> multiple cores (actually, this option was used eg. on a SGI-Origin
> machine with
> eg. 64 shared memory cpus).
>
> So if you have just ONE 4-core PC, you can used "shared memory", but when
> you want to couple TWO different PCs, you cannot do so.
>
> Please read the requirements for k-parallelization in the UG.
>
> Kakhaber Jandieri schrieb:
>>
>>> No, there was no change!
>>>
>>> Did you set "shared memory" ??  This would also explain why    
>>> everything runs on
>>> one machine ??
>>
>> Yes, I set "shared memory" for both versions of WIEN2K and   
>> accordingly I have setenv USE_REMOTE 0 in parallel_options.
>>
>>>
>>>
>>> Kakhaber Jandieri schrieb:
>>>> Dear prof. Blaha
>>>>
>>>>> I do NOT believe that k-point parallel with an older WIEN2k was possible
>>>>> (unless you set it up with "rsh" instead of "ssh" and defined a   
>>>>>   .rhosts file).
>>>>
>>>> But it is really possible. I checked again. I even reinstall the   
>>>>  WIEN2K_08.1 aiming to verify that the options are the same as   
>>>> that  used for WIEN2K_09.1.
>>>> I did not set "rsh" and did not define ".rhost file.
>>>> The behaviour of WIEN2K_08.1 is the same (as I described in my    
>>>> previous letters).
>>>> According to dayfile, k-points are distributed among all reserved  
>>>>   nodes. Here is a little fragment of the dayfile:
>>>>
>>>>
>>>> Calculating GaAsB in /home/kakhaber/wien_work/GaAsB
>>>> on node112 with PID 10597
>>>>
>>>>   start     (Wed Jun 16 09:50:23 CEST 2010) with lapw0 (40/99 to go)
>>>>
>>>>   cycle 1     (Wed Jun 16 09:50:23 CEST 2010)     (40/99 to go)
>>>>
>>>>> lapw0 -p    (09:50:23) starting parallel lapw0 at Wed Jun 16    
>>>>> 09:50:23 CEST 2010
>>>> --------
>>>> running lapw0 in single mode
>>>> 77.496u 0.628s 1:18.47 99.5%    0+0k 0+7008io 0pf+0w
>>>>> lapw1  -c -p     (09:51:42) starting parallel lapw1 at Wed Jun   
>>>>> 16  09:51:42 CEST 2010
>>>> ->  starting parallel LAPW1 jobs at Wed Jun 16 09:51:42 CEST 2010
>>>> running LAPW1 in parallel mode (using .machines)
>>>> 4 number_of_parallel_jobs
>>>>    node112(1) 2091.6u 2.3s 37:18.76 93.5% 0+0k 0+205296io 0pf+0w
>>>>    node105(1) 2024.2u 2.3s 34:26.54 98.0% 0+0k 0+198376io 0pf+0w
>>>>    node122(1) 2115.4u 5.1s 36:08.08 97.8% 0+0k 0+197808io 0pf+0w
>>>>    node131(1) 2041.3u 2.6s 35:19.70 96.4% 0+0k 0+202912io 0pf+0w
>>>>  Summary of lapw1para:
>>>>  node112     k=1     user=2091.6     wallclock=2238.76
>>>>  node105     k=1     user=2024.2     wallclock=2066.54
>>>>  node122     k=1     user=2115.4     wallclock=2168.08
>>>>  node131     k=1     user=2041.3     wallclock=2119.7
>>>> 8274.113u 15.744s 37:20.53 369.9%    0+0k 8+805440io 0pf+0w
>>>>> lapw2 -c  -p     (10:29:02) running LAPW2 in parallel mode
>>>>     node112 87.4u 0.5s 1:37.85 89.9% 0+0k 0+8104io 0pf+0w
>>>>     node105 86.8u 0.9s 1:30.90 96.5% 0+0k 198064+8096io 0pf+0w
>>>>     node122 84.7u 0.6s 1:27.71 97.3% 0+0k 0+8088io 0pf+0w
>>>>     node131 87.9u 1.0s 1:31.00 97.7% 0+0k 0+8088io 0pf+0w
>>>>  Summary of lapw2para:
>>>>  node112     user=87.4     wallclock=97.85
>>>>  node105     user=86.8     wallclock=90.9
>>>>  node122     user=84.7     wallclock=87.71
>>>>  node131     user=87.9     wallclock=91
>>>> 349.001u 3.592s 1:41.96 345.8%    0+0k 204504+42240io 0pf+0w
>>>>> lcore    (10:30:44) 0.176u 0.060s 0:01.05 21.9%    0+0k 0+5336io 0pf+0w
>>>>> mixer    (10:30:46) 1.436u 0.168s 0:01.99 79.8%    0+0k 0+11920io 0pf+0w
>>>> :ENERGY convergence:  0 0.001 0
>>>> :CHARGE convergence:  0 0.001 0
>>>> ec cc and fc_conv 0 0 0
>>>>
>>>> In spite of that, when I login into the nodes, I see the following:
>>>>
>>>> node112:~> nice top -c -u kakhaber (this is the master node)
>>>>
>>>> top - 10:58:42 up 116 days, 23:19,  1 user,  load average: 8.01,   
>>>> 7.77, 7.33
>>>> Tasks: 110 total,  10 running, 100 sleeping,   0 stopped,   0 zombie
>>>> Cpu(s): 90.5%us,  0.3%sy,  9.1%ni,  0.0%id,  0.0%wa,  0.0%hi,     
>>>> 0.0%si, 0.0%st
>>>> Mem:  16542480k total, 16144412k used,   398068k free,   105896k buffers
>>>> Swap:  4000144k total,    18144k used,  3982000k free, 11430460k cached
>>>>
>>>> PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>> 14474 kakhaber  20   0  937m 918m 2020 R  100  5.7  26:06.48   
>>>> lapw1c  lapw1_4.def
>>>> 14458 kakhaber  20   0  926m 907m 2020 R   98  5.6  26:01.62   
>>>> lapw1c  lapw1_3.def
>>>> 14443 kakhaber  20   0  934m 915m 2028 R   98  5.7  25:46.37   
>>>> lapw1c  lapw1_2.def
>>>> 14428 kakhaber  20   0  936m 917m 2028 R   66  5.7  24:26.43   
>>>> lapw1c  lapw1_1.def
>>>> 5952 kakhaber  20   0 13724 1360  820 S    0  0.0   0:00.00    
>>>> /bin/tcsh /var/spoo
>>>> 6077 kakhaber  20   0  3920  736  540 S    0  0.0   0:00.00    
>>>> /bin/csh -f /home/k
>>>> 10597 kakhaber  20   0  3928  780  572 S    0  0.0   0:00.00    
>>>> /bin/csh -f /home/k
>>>> 14320 kakhaber  20   0 11252 1180  772 S    0  0.0   0:00.00    
>>>> /bin/tcsh -f /home/
>>>> 14336 kakhaber  20   0  3920  800  604 S    0  0.0   0:00.62    
>>>> /bin/csh -f /home/k
>>>> 14427 kakhaber  20   0  3920  440  244 S    0  0.0   0:00.00    
>>>> /bin/csh -f /home/k
>>>> 14442 kakhaber  20   0  3920  432  236 S    0  0.0   0:00.00    
>>>> /bin/csh -f /home/k
>>>> 14457 kakhaber  20   0  3920  432  236 S    0  0.0   0:00.00    
>>>> /bin/csh -f /home/k
>>>> 14472 kakhaber  20   0  3920  432  236 S    0  0.0   0:00.00    
>>>> /bin/csh -f /home/k
>>>> 16499 kakhaber  20   0 77296 1808 1100 R    0  0.0   0:00.00   
>>>> sshd:  kakhaber at pts/
>>>> 16500 kakhaber  20   0 16212 2032 1080 S    0  0.0   0:00.02 -tcsh
>>>> 16603 kakhaber  24   4 10620 1120  848 R    0  0.0   0:00.02 top   
>>>> -c  -u kakhaber
>>>>
>>>>
>>>> node105:~> nice top -c -u kakhaber
>>>> top - 11:01:37 up 116 days, 23:23,  1 user,  load average: 3.00,   
>>>> 3.00, 3.00
>>>> Tasks:  99 total,   3 running,  96 sleeping,   0 stopped,   0 zombie
>>>> Cpu(s):  2.9%us, 20.6%sy, 49.9%ni, 25.0%id,  0.0%wa,  0.1%hi,     
>>>> 1.6%si, 0.0%st
>>>> Mem:  16542480k total,  6277020k used, 10265460k free,   233364k buffers
>>>> Swap:  4000144k total,    13512k used,  3986632k free,  5173312k cached
>>>>
>>>> PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>> 18955 kakhaber  20   0 77296 1808 1100 S    0  0.0   0:00.00   
>>>> sshd:  kakhaber at pts/
>>>> 18956 kakhaber  20   0 16212 2032 1080 S    0  0.0   0:00.02 -tcsh
>>>> 19071 kakhaber  24   4 10620 1112  848 R    0  0.0   0:00.00 top   
>>>> -c  -u kakhaber
>>>>
>>>> for node122 and node131    the output is the same as for node105.
>>>>
>>>>> Anyway, k-parallel does not use mpi at all and you have to read the
>>>>> requirements specified in the UG.
>>>>
>>>> I know, but I meant the following: if k-point parallelization in   
>>>>  WIEN2K_09.1 does not work because of the problem with    
>>>> interconnection between different nodes, then I thought that the   
>>>>  MPI-parallelization also should be impossible. But the   
>>>> MPI-parallel  jobs runs without any problem.
>>>>
>>>> I suggest one possibility (may be trivial or wrong).
>>>> In WIEN2K_08.1 k-point parallelization works but all processes   
>>>> run  on the master node. In WIEN2K_09.1 k-point parallelization   
>>>> does not  work at all. May be there is some restriction in   
>>>> WIEN2K_09.1  preventing different k-processes to be ran on the   
>>>> same node and  this is the reason of the crash in parallel lapw1?
>>>>
>>>> Is such suggestion reasonable?
>>>> I will be extremely thankful for your additional advices.
>>>>
>>>>
>>>>> Kakhaber Jandieri schrieb:
>>>>>> Dear Prof. Blaha,
>>>>>>
>>>>>> Thank you for your reply.
>>>>>>
>>>>>>> Can you    ssh node120 ps
>>>>>>> without supplying a password ?
>>>>>>
>>>>>> No, I can't ssh the nodes without password supply, but in my     
>>>>>> parallel_options I have setenv MPI_REMOTE 0. I thought that our  
>>>>>>    cluster has a shared memory architecture, since the     
>>>>>> MPI-parallelization works without any problem for 1 k-point. I   
>>>>>>   cheeked the corresponding nodes. All they were loaded. May be  
>>>>>>  I   misunderstood something. Are the requirements for     
>>>>>> MPI-parallelization different from that for k-point   
>>>>>> paralleization?
>>>>>>
>>>>>>> Try x lapw1 -p on the commandline.
>>>>>>> What exactly is the "error" ?
>>>>>>
>>>>>> Just now, to try your suggestions, I ran new task with k-point   
>>>>>>   parallelization. The .machines file is:
>>>>>> granularity:1
>>>>>> 1:node120
>>>>>> 1:node127
>>>>>> 1:node121
>>>>>> 1:node123
>>>>>>
>>>>>> with node120 as a master node.
>>>>>>
>>>>>> The output of x lapw -p is:
>>>>>> starting parallel lapw1 at Sun Jun 13 22:44:08 CEST 2010
>>>>>> ->  starting parallel LAPW1 jobs at Sun Jun 13 22:44:08 CEST 2010
>>>>>> running LAPW1 in parallel mode (using .machines)
>>>>>> 4 number_of_parallel_jobs
>>>>>> [1] 31314
>>>>>> [2] 31341
>>>>>> [3] 31357
>>>>>> [4] 31373
>>>>>> Permission denied, please try again.
>>>>>> Permission denied, please try again.
>>>>>> Received disconnect from 172.26.6.120: 2: Too many   
>>>>>> authentication  failures for kakhaber
>>>>>> [1]    Done                   ( ( $remote $machine[$p]  ...
>>>>>> Permission denied, please try again.
>>>>>> Permission denied, please try again.
>>>>>> Received disconnect from 172.26.6.127: 2: Too many   
>>>>>> authentication  failures for kakhaber
>>>>>> Permission denied, please try again.
>>>>>> Permission denied, please try again.
>>>>>> Received disconnect from 172.26.6.121: 2: Too many   
>>>>>> authentication  failures for kakhaber
>>>>>> [3]  - Done                   ( ( $remote $machine[$p]  ...
>>>>>> [2]  - Done                   ( ( $remote $machine[$p]  ...
>>>>>> Permission denied, please try again.
>>>>>> Permission denied, please try again.
>>>>>> Received disconnect from 172.26.6.123: 2: Too many   
>>>>>> authentication  failures for kakhaber
>>>>>> [4]    Done                   ( ( $remote $machine[$p]  ...
>>>>>>   node120(1)      node127(1)      node121(1)      node123(1) **  
>>>>>>    LAPW1 crashed!
>>>>>> cat: No match.
>>>>>> 0.116u 0.324s 0:11.88 3.6%        0+0k 0+864io 0pf+0w
>>>>>> error: command   /home/kakhaber/WIEN2K_09/lapw1cpara -c   
>>>>>> lapw1.def  failed
>>>>>>
>>>>>>> How many k-points do you have ? ( 4 ?)
>>>>>>
>>>>>> Yes, I have 4 k-points.
>>>>>>
>>>>>>> Content of .machine1 and .processes
>>>>>>
>>>>>> marc-hn:~/wien_work/GaAsB> cat .machine1 node120     
>>>>>> marc-hn:~/wien_work/GaAsB> cat .machine2
>>>>>> node127
>>>>>> marc-hn:~/wien_work/GaAsB> cat .machine3
>>>>>> node121
>>>>>> marc-hn:~/wien_work/GaAsB> cat .machine4
>>>>>> node123
>>>>>>
>>>>>> marc-hn:~/wien_work/GaAsB> cat .processes
>>>>>> init:node120
>>>>>> init:node127
>>>>>> init:node121
>>>>>> init:node123
>>>>>> 1 : node120 :  1 : 1 : 1
>>>>>> 2 : node127 :  1 : 1 : 2
>>>>>> 3 : node121 :  1 : 1 : 3
>>>>>> 4 : node123 :  1 : 1 : 4
>>>>>>
>>>>>>> While x lapw1 -p is running, do a    ps -ef |grep lapw
>>>>>>
>>>>>> I had not enough time to do it - the program crashed before.
>>>>>>
>>>>>>> Your .machines file is most likely a rather "useless" one. The  
>>>>>>>  mpi-lapw1
>>>>>>> diagonalization (SCALAPACK) is almost a factor of 2 slower   
>>>>>>> than  the serial
>>>>>>> version, thus your speedup by using 2 processors in mpi-mode will be
>>>>>>> very small.
>>>>>>
>>>>>> Yes, I know, but I am simply trying to arrange the calculations  
>>>>>>    using Wien2K. For "real" calculations I will use much more    
>>>>>>  processors.
>>>>>>
>>>>>> And finally, for additional information. As I wrote in my    
>>>>>> previous  letters, in
>>>>>> WIEN2k_08.1 k-point parallelization works, but all processes   
>>>>>> are   running on master node and all other reserved nodes are   
>>>>>> idle. I   forgot to mention: this is true for lapw1 only. Lapw2  
>>>>>>  is   distributed among all reserved nodes.
>>>>>>
>>>>>> Thank you one again. I am looking forward for your further advices.
>>>>>>
>>>>>>
>>>>>> Dr. Kakhaber Jandieri
>>>>>> Department of Physics
>>>>>> Philipps University Marburg
>>>>>> Tel:+49 6421 2824159 (2825704)
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Wien mailing list
>>>>>> Wien at zeus.theochem.tuwien.ac.at
>>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>>
>>>>> -- 
>>>>> -----------------------------------------
>>>>> Peter Blaha
>>>>> Inst. Materials Chemistry, TU Vienna
>>>>> Getreidemarkt 9, A-1060 Vienna, Austria
>>>>> Tel: +43-1-5880115671
>>>>> Fax: +43-1-5880115698
>>>>> email: pblaha at theochem.tuwien.ac.at
>>>>> -----------------------------------------
>>>>> _______________________________________________
>>>>> Wien mailing list
>>>>> Wien at zeus.theochem.tuwien.ac.at
>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>
>>>>
>>>>
>>>> Dr. Kakhaber Jandieri
>>>> Department of Physics
>>>> Philipps University Marburg
>>>> Tel:+49 6421 2824159 (2825704)
>>>>
>>>>
>>>> _______________________________________________
>>>> Wien mailing list
>>>> Wien at zeus.theochem.tuwien.ac.at
>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>
>>> -- 
>>>
>>>                                      P.Blaha
>>> --------------------------------------------------------------------------  
>>>  Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
>>> Phone: +43-1-58801-15671             FAX: +43-1-58801-15698
>>> Email: blaha at theochem.tuwien.ac.at    WWW:   
>>> http://info.tuwien.ac.at/theochem/
>>> --------------------------------------------------------------------------  
>>>  _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>>
>>
>> Dr. Kakhaber Jandieri
>> Department of Physics
>> Philipps University Marburg
>> Tel:+49 6421 2824159 (2825704)
>>
>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
> -- 
>
>                                       P.Blaha
> --------------------------------------------------------------------------
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-15671             FAX: +43-1-58801-15698
> Email: blaha at theochem.tuwien.ac.at    WWW: http://info.tuwien.ac.at/theochem/
> --------------------------------------------------------------------------
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

Dr. Kakhaber Jandieri
Department of Physics
Philipps University Marburg
Tel:+49 6421 2824159 (2825704)