[Wien] k-point parallelization in WIEN2K_09.1

Wed Jun 16 15:46:31 CEST 2010

Dear prof. Blaha

> I do NOT believe that k-point parallel with an older WIEN2k was possible
> (unless you set it up with "rsh" instead of "ssh" and defined a   
> .rhosts file).

But it is really possible. I checked again. I even reinstall the  
WIEN2K_08.1 aiming to verify that the options are the same as that  
used for WIEN2K_09.1.
I did not set "rsh" and did not define ".rhost file.
The behaviour of WIEN2K_08.1 is the same (as I described in my  
previous letters).
According to dayfile, k-points are distributed among all reserved  
nodes. Here is a little fragment of the dayfile:

Calculating GaAsB in /home/kakhaber/wien_work/GaAsB
on node112 with PID 10597

     start 	(Wed Jun 16 09:50:23 CEST 2010) with lapw0 (40/99 to go)

     cycle 1 	(Wed Jun 16 09:50:23 CEST 2010) 	(40/99 to go)

>   lapw0 -p	(09:50:23) starting parallel lapw0 at Wed Jun 16 09:50:23  
> CEST 2010
--------
running lapw0 in single mode
77.496u 0.628s 1:18.47 99.5%	0+0k 0+7008io 0pf+0w
>   lapw1  -c -p 	(09:51:42) starting parallel lapw1 at Wed Jun 16  
> 09:51:42 CEST 2010
->  starting parallel LAPW1 jobs at Wed Jun 16 09:51:42 CEST 2010
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
      node112(1) 2091.6u 2.3s 37:18.76 93.5% 0+0k 0+205296io 0pf+0w
      node105(1) 2024.2u 2.3s 34:26.54 98.0% 0+0k 0+198376io 0pf+0w
      node122(1) 2115.4u 5.1s 36:08.08 97.8% 0+0k 0+197808io 0pf+0w
      node131(1) 2041.3u 2.6s 35:19.70 96.4% 0+0k 0+202912io 0pf+0w
    Summary of lapw1para:
    node112	 k=1	 user=2091.6	 wallclock=2238.76
    node105	 k=1	 user=2024.2	 wallclock=2066.54
    node122	 k=1	 user=2115.4	 wallclock=2168.08
    node131	 k=1	 user=2041.3	 wallclock=2119.7
8274.113u 15.744s 37:20.53 369.9%	0+0k 8+805440io 0pf+0w
>   lapw2 -c  -p 	(10:29:02) running LAPW2 in parallel mode
       node112 87.4u 0.5s 1:37.85 89.9% 0+0k 0+8104io 0pf+0w
       node105 86.8u 0.9s 1:30.90 96.5% 0+0k 198064+8096io 0pf+0w
       node122 84.7u 0.6s 1:27.71 97.3% 0+0k 0+8088io 0pf+0w
       node131 87.9u 1.0s 1:31.00 97.7% 0+0k 0+8088io 0pf+0w
    Summary of lapw2para:
    node112	 user=87.4	 wallclock=97.85
    node105	 user=86.8	 wallclock=90.9
    node122	 user=84.7	 wallclock=87.71
    node131	 user=87.9	 wallclock=91
349.001u 3.592s 1:41.96 345.8%	0+0k 204504+42240io 0pf+0w
>   lcore	(10:30:44) 0.176u 0.060s 0:01.05 21.9%	0+0k 0+5336io 0pf+0w
>   mixer	(10:30:46) 1.436u 0.168s 0:01.99 79.8%	0+0k 0+11920io 0pf+0w
:ENERGY convergence:  0 0.001 0
:CHARGE convergence:  0 0.001 0
ec cc and fc_conv 0 0 0

In spite of that, when I login into the nodes, I see the following:

node112:~> nice top -c -u kakhaber (this is the master node)

top - 10:58:42 up 116 days, 23:19,  1 user,  load average: 8.01, 7.77, 7.33
Tasks: 110 total,  10 running, 100 sleeping,   0 stopped,   0 zombie
Cpu(s): 90.5%us,  0.3%sy,  9.1%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16542480k total, 16144412k used,   398068k free,   105896k buffers
Swap:  4000144k total,    18144k used,  3982000k free, 11430460k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
14474 kakhaber  20   0  937m 918m 2020 R  100  5.7  26:06.48 lapw1c  
lapw1_4.def
14458 kakhaber  20   0  926m 907m 2020 R   98  5.6  26:01.62 lapw1c  
lapw1_3.def
14443 kakhaber  20   0  934m 915m 2028 R   98  5.7  25:46.37 lapw1c  
lapw1_2.def
14428 kakhaber  20   0  936m 917m 2028 R   66  5.7  24:26.43 lapw1c  
lapw1_1.def
  5952 kakhaber  20   0 13724 1360  820 S    0  0.0   0:00.00  
/bin/tcsh /var/spoo
  6077 kakhaber  20   0  3920  736  540 S    0  0.0   0:00.00 /bin/csh  
-f /home/k
10597 kakhaber  20   0  3928  780  572 S    0  0.0   0:00.00 /bin/csh  
-f /home/k
14320 kakhaber  20   0 11252 1180  772 S    0  0.0   0:00.00 /bin/tcsh  
-f /home/
14336 kakhaber  20   0  3920  800  604 S    0  0.0   0:00.62 /bin/csh  
-f /home/k
14427 kakhaber  20   0  3920  440  244 S    0  0.0   0:00.00 /bin/csh  
-f /home/k
14442 kakhaber  20   0  3920  432  236 S    0  0.0   0:00.00 /bin/csh  
-f /home/k
14457 kakhaber  20   0  3920  432  236 S    0  0.0   0:00.00 /bin/csh  
-f /home/k
14472 kakhaber  20   0  3920  432  236 S    0  0.0   0:00.00 /bin/csh  
-f /home/k
16499 kakhaber  20   0 77296 1808 1100 R    0  0.0   0:00.00 sshd:  
kakhaber at pts/
16500 kakhaber  20   0 16212 2032 1080 S    0  0.0   0:00.02 -tcsh
16603 kakhaber  24   4 10620 1120  848 R    0  0.0   0:00.02 top -c -u  
kakhaber

node105:~> nice top -c -u kakhaber
top - 11:01:37 up 116 days, 23:23,  1 user,  load average: 3.00, 3.00, 3.00
Tasks:  99 total,   3 running,  96 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.9%us, 20.6%sy, 49.9%ni, 25.0%id,  0.0%wa,  0.1%hi,  1.6%si,  0.0%st
Mem:  16542480k total,  6277020k used, 10265460k free,   233364k buffers
Swap:  4000144k total,    13512k used,  3986632k free,  5173312k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
18955 kakhaber  20   0 77296 1808 1100 S    0  0.0   0:00.00 sshd:  
kakhaber at pts/
18956 kakhaber  20   0 16212 2032 1080 S    0  0.0   0:00.02 -tcsh
19071 kakhaber  24   4 10620 1112  848 R    0  0.0   0:00.00 top -c -u  
kakhaber

for node122 and node131	the output is the same as for node105.

> Anyway, k-parallel does not use mpi at all and you have to read the
> requirements specified in the UG.

I know, but I meant the following: if k-point parallelization in  
WIEN2K_09.1 does not work because of the problem with interconnection  
between different nodes, then I thought that the MPI-parallelization  
also should be impossible. But the MPI-parallel jobs runs without any  
problem.

I suggest one possibility (may be trivial or wrong).
In WIEN2K_08.1 k-point parallelization works but all processes run on  
the master node. In WIEN2K_09.1 k-point parallelization does not work  
at all. May be there is some restriction in WIEN2K_09.1 preventing  
different k-processes to be ran on the same node and this is the  
reason of the crash in parallel lapw1?

Is such suggestion reasonable?
I will be extremely thankful for your additional advices.

> Kakhaber Jandieri schrieb:
>> Dear Prof. Blaha,
>>
>> Thank you for your reply.
>>
>>> Can you    ssh node120 ps
>>> without supplying a password ?
>>
>> No, I can't ssh the nodes without password supply, but in my   
>> parallel_options I have setenv MPI_REMOTE 0. I thought that our   
>> cluster has a shared memory architecture, since the   
>> MPI-parallelization works without any problem for 1 k-point. I   
>> cheeked the corresponding nodes. All they were loaded. May be I   
>> misunderstood something. Are the requirements for   
>> MPI-parallelization different from that for k-point paralleization?
>>
>>> Try x lapw1 -p on the commandline.
>>> What exactly is the "error" ?
>>
>> Just now, to try your suggestions, I ran new task with k-point   
>> parallelization. The .machines file is:
>> granularity:1
>> 1:node120
>> 1:node127
>> 1:node121
>> 1:node123
>>
>> with node120 as a master node.
>>
>> The output of x lapw -p is:
>> starting parallel lapw1 at Sun Jun 13 22:44:08 CEST 2010
>> ->  starting parallel LAPW1 jobs at Sun Jun 13 22:44:08 CEST 2010
>> running LAPW1 in parallel mode (using .machines)
>> 4 number_of_parallel_jobs
>> [1] 31314
>> [2] 31341
>> [3] 31357
>> [4] 31373
>> Permission denied, please try again.
>> Permission denied, please try again.
>> Received disconnect from 172.26.6.120: 2: Too many authentication   
>> failures for kakhaber
>> [1]    Done                   ( ( $remote $machine[$p]  ...
>> Permission denied, please try again.
>> Permission denied, please try again.
>> Received disconnect from 172.26.6.127: 2: Too many authentication   
>> failures for kakhaber
>> Permission denied, please try again.
>> Permission denied, please try again.
>> Received disconnect from 172.26.6.121: 2: Too many authentication   
>> failures for kakhaber
>> [3]  - Done                   ( ( $remote $machine[$p]  ...
>> [2]  - Done                   ( ( $remote $machine[$p]  ...
>> Permission denied, please try again.
>> Permission denied, please try again.
>> Received disconnect from 172.26.6.123: 2: Too many authentication   
>> failures for kakhaber
>> [4]    Done                   ( ( $remote $machine[$p]  ...
>>     node120(1)      node127(1)      node121(1)      node123(1) **    
>> LAPW1 crashed!
>> cat: No match.
>> 0.116u 0.324s 0:11.88 3.6%        0+0k 0+864io 0pf+0w
>> error: command   /home/kakhaber/WIEN2K_09/lapw1cpara -c lapw1.def   failed
>>
>>> How many k-points do you have ? ( 4 ?)
>>
>> Yes, I have 4 k-points.
>>
>>> Content of .machine1 and .processes
>>
>> marc-hn:~/wien_work/GaAsB> cat .machine1 node120   
>> marc-hn:~/wien_work/GaAsB> cat .machine2
>> node127
>> marc-hn:~/wien_work/GaAsB> cat .machine3
>> node121
>> marc-hn:~/wien_work/GaAsB> cat .machine4
>> node123
>>
>> marc-hn:~/wien_work/GaAsB> cat .processes
>> init:node120
>> init:node127
>> init:node121
>> init:node123
>> 1 : node120 :  1 : 1 : 1
>> 2 : node127 :  1 : 1 : 2
>> 3 : node121 :  1 : 1 : 3
>> 4 : node123 :  1 : 1 : 4
>>
>>> While x lapw1 -p is running, do a    ps -ef |grep lapw
>>
>> I had not enough time to do it - the program crashed before.
>>
>>> Your .machines file is most likely a rather "useless" one. The mpi-lapw1
>>> diagonalization (SCALAPACK) is almost a factor of 2 slower than the serial
>>> version, thus your speedup by using 2 processors in mpi-mode will be
>>> very small.
>>
>> Yes, I know, but I am simply trying to arrange the calculations   
>> using Wien2K. For "real" calculations I will use much more   
>> processors.
>>
>> And finally, for additional information. As I wrote in my previous   
>> letters, in
>> WIEN2k_08.1 k-point parallelization works, but all processes are   
>> running on master node and all other reserved nodes are idle. I   
>> forgot to mention: this is true for lapw1 only. Lapw2 is   
>> distributed among all reserved nodes.
>>
>> Thank you one again. I am looking forward for your further advices.
>>
>>
>> Dr. Kakhaber Jandieri
>> Department of Physics
>> Philipps University Marburg
>> Tel:+49 6421 2824159 (2825704)
>>
>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
> -- 
> -----------------------------------------
> Peter Blaha
> Inst. Materials Chemistry, TU Vienna
> Getreidemarkt 9, A-1060 Vienna, Austria
> Tel: +43-1-5880115671
> Fax: +43-1-5880115698
> email: pblaha at theochem.tuwien.ac.at
> -----------------------------------------
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

Dr. Kakhaber Jandieri
Department of Physics
Philipps University Marburg
Tel:+49 6421 2824159 (2825704)