[Wien] slurm mpi

Gavin Abo gsabo at crimson.ua.edu
Tue May 7 04:20:44 CEST 2019


WIEN2k 18.2 usersguide (pg. 237) has:

USE_REMOTE [0|1] determines whether parallel jobs are run in background 
(on shared memory machines) or using ssh.

Since you are utilizing ssh-copy-id for using ssh, you most likely need 
USE_REMOTE=1 [ 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg17572.html 
].

"permission_denied, please try again" might come from failed password 
entry as seen at:

https://askubuntu.com/questions/315377/ssh-error-permission-denied-please-try-again

In your .machines, you have hostname for each of your nodes.  You don't 
mention what they are called, but say they are called node1, node2, ..., 
etc.

Try to ssh into each of the nodes listed in your .machines file from 
your head node:

ssh node1
exit
ssh node2
exit
...

That might help you find which nodes the "permission_denied, please try 
again" is occurring with.

Maybe less likely the issue, but it might also be that you need compute 
node to compute node keys setup [ 
https://users.open-mpi.narkive.com/mtYcZsVm/ompi-users-problem-with-connecting-to-3-or-more-nodes 
], for example, you may need to check ssh from node 1 to node 2 (and so on):

ssh node1
ssh node2
...
exit
exit
...

If you have issues with passwordless login using SSH keys, the following 
webpages might help:

https://www.tecmint.com/ssh-passwordless-login-using-ssh-keygen-in-5-easy-steps/
https://www.ssh.com/ssh/copy-id

The "1: hostname hostname" I'm not sure how that behaves.  I suggest 
using the format:

1:hostname:1

like for example at links:

https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg17110.html
http://www.wien2k.at/reg_user/faq/ecss_hliu_051012.pdf (for example on 
slide 7, 3:gamma:2)


On 5/6/2019 6:23 AM, webfinder at ukr.net wrote:
> Dear wien2k users,
>
> wien2k_18.2
> I'm trying to run a test task on a cluster with slurm batch system 
> using mpi parallelization.
>
> In "parallel_options" USE_REMOTE=0, MPI_REMOTE=0.
> (during the siteconfig_lapw the slurm option was chosen)
>
> the k-point parallelization works well. But if I change the 
> "slurm.job" script to produce .machines file for mpi run
> (e.g. from
> 1: hostname
> 1: hostname
> ....
> to
> 1: hostname hostname ....)
>
> there is always a error message:
> permission_denied, please try again.
> permission_denied, please try again
> permission_denied, please try again (....)
>
> How can I solve this?
> How could it be that k-point parallelization works but mpi not?
>
> P.S. I have also tried after getting "nodelist" from batch system to 
> include ssh-copy-id command to slurm.job script to copy the keys but 
> the result is the same.
>
> Thank you for the answers!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20190506/ac8d56eb/attachment.html>


More information about the Wien mailing list