[Wien] k-point parallelization in WIEN2K_09.1

Kakhaber Jandieri kakhaber.jandieri at physik.uni-marburg.de
Mon Jun 14 01:34:10 CEST 2010


Dear Prof. Blaha,

Thank you for your reply.

> Can you    ssh node120 ps
> without supplying a password ?

No, I can't ssh the nodes without password supply, but in my  
parallel_options I have setenv MPI_REMOTE 0. I thought that our  
cluster has a shared memory architecture, since the  
MPI-parallelization works without any problem for 1 k-point. I cheeked  
the corresponding nodes. All they were loaded. May be I misunderstood  
something. Are the requirements for MPI-parallelization different from  
that for k-point paralleization?

> Try x lapw1 -p on the commandline.
> What exactly is the "error" ?

Just now, to try your suggestions, I ran new task with k-point  
parallelization. The .machines file is:
granularity:1
1:node120
1:node127
1:node121
1:node123

with node120 as a master node.

The output of x lapw -p is:
starting parallel lapw1 at Sun Jun 13 22:44:08 CEST 2010
->  starting parallel LAPW1 jobs at Sun Jun 13 22:44:08 CEST 2010
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
[1] 31314
[2] 31341
[3] 31357
[4] 31373
Permission denied, please try again.
Permission denied, please try again.
Received disconnect from 172.26.6.120: 2: Too many authentication  
failures for kakhaber
[1]    Done                   ( ( $remote $machine[$p]  ...
Permission denied, please try again.
Permission denied, please try again.
Received disconnect from 172.26.6.127: 2: Too many authentication  
failures for kakhaber
Permission denied, please try again.
Permission denied, please try again.
Received disconnect from 172.26.6.121: 2: Too many authentication  
failures for kakhaber
[3]  - Done                   ( ( $remote $machine[$p]  ...
[2]  - Done                   ( ( $remote $machine[$p]  ...
Permission denied, please try again.
Permission denied, please try again.
Received disconnect from 172.26.6.123: 2: Too many authentication  
failures for kakhaber
[4]    Done                   ( ( $remote $machine[$p]  ...
      node120(1)      node127(1)      node121(1)      node123(1) **   
LAPW1 crashed!
cat: No match.
0.116u 0.324s 0:11.88 3.6%        0+0k 0+864io 0pf+0w
error: command   /home/kakhaber/WIEN2K_09/lapw1cpara -c lapw1.def   failed

> How many k-points do you have ? ( 4 ?)

  Yes, I have 4 k-points.

> Content of .machine1 and .processes

marc-hn:~/wien_work/GaAsB> cat .machine1 node120  
marc-hn:~/wien_work/GaAsB> cat .machine2
node127
marc-hn:~/wien_work/GaAsB> cat .machine3
node121
marc-hn:~/wien_work/GaAsB> cat .machine4
node123

marc-hn:~/wien_work/GaAsB> cat .processes
init:node120
init:node127
init:node121
init:node123
1 : node120 :  1 : 1 : 1
2 : node127 :  1 : 1 : 2
3 : node121 :  1 : 1 : 3
4 : node123 :  1 : 1 : 4

> While x lapw1 -p is running, do a    ps -ef |grep lapw

I had not enough time to do it - the program crashed before.

> Your .machines file is most likely a rather "useless" one. The mpi-lapw1
> diagonalization (SCALAPACK) is almost a factor of 2 slower than the serial
> version, thus your speedup by using 2 processors in mpi-mode will be
> very small.

Yes, I know, but I am simply trying to arrange the calculations using  
Wien2K. For "real" calculations I will use much more processors.

And finally, for additional information. As I wrote in my previous letters, in
WIEN2k_08.1 k-point parallelization works, but all processes are  
running on master node and all other reserved nodes are idle. I forgot  
to mention: this is true for lapw1 only. Lapw2 is distributed among  
all reserved nodes.

Thank you one again. I am looking forward for your further advices.


Dr. Kakhaber Jandieri
Department of Physics
Philipps University Marburg
Tel:+49 6421 2824159 (2825704)




More information about the Wien mailing list