[Wien] slurm mpi
Gavin Abo
gsabo at crimson.ua.edu
Tue May 7 04:20:44 CEST 2019
WIEN2k 18.2 usersguide (pg. 237) has:
USE_REMOTE [0|1] determines whether parallel jobs are run in background
(on shared memory machines) or using ssh.
Since you are utilizing ssh-copy-id for using ssh, you most likely need
USE_REMOTE=1 [
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg17572.html
].
"permission_denied, please try again" might come from failed password
entry as seen at:
https://askubuntu.com/questions/315377/ssh-error-permission-denied-please-try-again
In your .machines, you have hostname for each of your nodes. You don't
mention what they are called, but say they are called node1, node2, ...,
etc.
Try to ssh into each of the nodes listed in your .machines file from
your head node:
ssh node1
exit
ssh node2
exit
...
That might help you find which nodes the "permission_denied, please try
again" is occurring with.
Maybe less likely the issue, but it might also be that you need compute
node to compute node keys setup [
https://users.open-mpi.narkive.com/mtYcZsVm/ompi-users-problem-with-connecting-to-3-or-more-nodes
], for example, you may need to check ssh from node 1 to node 2 (and so on):
ssh node1
ssh node2
...
exit
exit
...
If you have issues with passwordless login using SSH keys, the following
webpages might help:
https://www.tecmint.com/ssh-passwordless-login-using-ssh-keygen-in-5-easy-steps/
https://www.ssh.com/ssh/copy-id
The "1: hostname hostname" I'm not sure how that behaves. I suggest
using the format:
1:hostname:1
like for example at links:
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg17110.html
http://www.wien2k.at/reg_user/faq/ecss_hliu_051012.pdf (for example on
slide 7, 3:gamma:2)
On 5/6/2019 6:23 AM, webfinder at ukr.net wrote:
> Dear wien2k users,
>
> wien2k_18.2
> I'm trying to run a test task on a cluster with slurm batch system
> using mpi parallelization.
>
> In "parallel_options" USE_REMOTE=0, MPI_REMOTE=0.
> (during the siteconfig_lapw the slurm option was chosen)
>
> the k-point parallelization works well. But if I change the
> "slurm.job" script to produce .machines file for mpi run
> (e.g. from
> 1: hostname
> 1: hostname
> ....
> to
> 1: hostname hostname ....)
>
> there is always a error message:
> permission_denied, please try again.
> permission_denied, please try again
> permission_denied, please try again (....)
>
> How can I solve this?
> How could it be that k-point parallelization works but mpi not?
>
> P.S. I have also tried after getting "nodelist" from batch system to
> include ssh-copy-id command to slurm.job script to copy the keys but
> the result is the same.
>
> Thank you for the answers!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20190506/ac8d56eb/attachment.html>
More information about the Wien
mailing list