[Wien] k-point parallelization in WIEN2K_09.1
Kakhaber Jandieri
kakhaber.jandieri at physik.uni-marburg.de
Mon Jun 14 01:34:10 CEST 2010
Dear Prof. Blaha,
Thank you for your reply.
> Can you ssh node120 ps
> without supplying a password ?
No, I can't ssh the nodes without password supply, but in my
parallel_options I have setenv MPI_REMOTE 0. I thought that our
cluster has a shared memory architecture, since the
MPI-parallelization works without any problem for 1 k-point. I cheeked
the corresponding nodes. All they were loaded. May be I misunderstood
something. Are the requirements for MPI-parallelization different from
that for k-point paralleization?
> Try x lapw1 -p on the commandline.
> What exactly is the "error" ?
Just now, to try your suggestions, I ran new task with k-point
parallelization. The .machines file is:
granularity:1
1:node120
1:node127
1:node121
1:node123
with node120 as a master node.
The output of x lapw -p is:
starting parallel lapw1 at Sun Jun 13 22:44:08 CEST 2010
-> starting parallel LAPW1 jobs at Sun Jun 13 22:44:08 CEST 2010
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
[1] 31314
[2] 31341
[3] 31357
[4] 31373
Permission denied, please try again.
Permission denied, please try again.
Received disconnect from 172.26.6.120: 2: Too many authentication
failures for kakhaber
[1] Done ( ( $remote $machine[$p] ...
Permission denied, please try again.
Permission denied, please try again.
Received disconnect from 172.26.6.127: 2: Too many authentication
failures for kakhaber
Permission denied, please try again.
Permission denied, please try again.
Received disconnect from 172.26.6.121: 2: Too many authentication
failures for kakhaber
[3] - Done ( ( $remote $machine[$p] ...
[2] - Done ( ( $remote $machine[$p] ...
Permission denied, please try again.
Permission denied, please try again.
Received disconnect from 172.26.6.123: 2: Too many authentication
failures for kakhaber
[4] Done ( ( $remote $machine[$p] ...
node120(1) node127(1) node121(1) node123(1) **
LAPW1 crashed!
cat: No match.
0.116u 0.324s 0:11.88 3.6% 0+0k 0+864io 0pf+0w
error: command /home/kakhaber/WIEN2K_09/lapw1cpara -c lapw1.def failed
> How many k-points do you have ? ( 4 ?)
Yes, I have 4 k-points.
> Content of .machine1 and .processes
marc-hn:~/wien_work/GaAsB> cat .machine1 node120
marc-hn:~/wien_work/GaAsB> cat .machine2
node127
marc-hn:~/wien_work/GaAsB> cat .machine3
node121
marc-hn:~/wien_work/GaAsB> cat .machine4
node123
marc-hn:~/wien_work/GaAsB> cat .processes
init:node120
init:node127
init:node121
init:node123
1 : node120 : 1 : 1 : 1
2 : node127 : 1 : 1 : 2
3 : node121 : 1 : 1 : 3
4 : node123 : 1 : 1 : 4
> While x lapw1 -p is running, do a ps -ef |grep lapw
I had not enough time to do it - the program crashed before.
> Your .machines file is most likely a rather "useless" one. The mpi-lapw1
> diagonalization (SCALAPACK) is almost a factor of 2 slower than the serial
> version, thus your speedup by using 2 processors in mpi-mode will be
> very small.
Yes, I know, but I am simply trying to arrange the calculations using
Wien2K. For "real" calculations I will use much more processors.
And finally, for additional information. As I wrote in my previous letters, in
WIEN2k_08.1 k-point parallelization works, but all processes are
running on master node and all other reserved nodes are idle. I forgot
to mention: this is true for lapw1 only. Lapw2 is distributed among
all reserved nodes.
Thank you one again. I am looking forward for your further advices.
Dr. Kakhaber Jandieri
Department of Physics
Philipps University Marburg
Tel:+49 6421 2824159 (2825704)
More information about the Wien
mailing list