[Wien] slurm mpi

Gavin Abo gsabo at crimson.ua.edu
Tue May 7 14:38:49 CEST 2019


The "Permission denied 
(publickey,gssapi-keyex,gssapi-with-mic,password)" comes up with 
different causes in a Google search.  One time, that error seemed to go 
away with a user by having them ssh into the nodes and fix the ssh file 
permissions following the webpage:

https://serverfault.com/questions/253313/ssh-returns-bad-owner-or-permissions-on-ssh-config

However, since you are not able to ssh directly into the nodes, you 
would likely have to ask your admin how to indirectly do it or have them 
do it for you.


On 5/7/2019 3:33 AM, webfinder at ukr.net wrote:
> Dear Prof. Blaha
>
> thank you for the explanation!
> Sorry, I should put hostname in quotes. Script I used is based on that 
> in the WIEN-FAQ and produce .machines based on the nodes provided by 
> the slurm:
> for k-points:
> #
> 1:n270
> 1:n270
> 1:n270
> 1:n270
> 1:n270
> ....
> granularity:1
> extrafine:1
>
> for mpi:
> #
> 1:n270 n270 n270 n270 n270 ....
> granularity:1
> extrafine:1
>
> After I changed USE_REMOTE to 1 the "Permission denied, please try 
> again" appears also for k-point parallelization.
> As it is stated in the userguide I did things like "ssh-keygen" and 
> copy to "authorized_keys" but result is the same.
> As a "low-level" user on a cluster I dont have any permission to login 
> to the nodes.
> For k-point parallelezation with USE_REMOTE=1 the *.out file has the 
> lines:
> Got 96 cores nodelist n[270-272] tasks_per_node 32 jobs_per_node 32 
> because OMP_NUM_THREADS = 1 96 nodes for this job: n270 n270 n270 n270 
> n270 n270 ....
> 10:04:01 up 18 days, 58 min, 0 users, load average: 0.04, 0.04, 0.07 
> USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
> ...
> -------- .machine0 : processors
> running dstart in single mode C T F DSTART ENDS 22.030u 0.102s 0:22.20 
> 99.6% 0+0k 0+0io 0pf+0w LAPW0 END full diagonalization forced 
> Permission denied, please try again. Permission denied, please try 
> again. Permission denied 
> (publickey,gssapi-keyex,gssapi-with-mic,password).
> [1] + Done ( ( $remote $machine[$p] "cd $PWD;$t $taskset0 $exe 
> ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f 
> .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) 
> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% 
> .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print 
> stderr <STDIN>" ) Permission denied, please try again. Permission 
> denied, please try again. Permission denied 
> (publickey,gssapi-keyex,gssapi-with-mic,password).
> ...
>
> For mpi-parallelization with USE_REMOTE=1, MPI_REMOTE=0, WIEN_MPIRUN 
> "srun ..."
> the output is:
> LAPW0 END
> Abort(0) on node 0 (rank 0 in comm 0): application called 
> MPI_Abort(MPI_COMM_WORLD, 0) - process 0
> Abort(0) on node 0 (rank 0 in comm 0): application called 
> MPI_Abort(MPI_COMM_WORLD, 0) - process 0
> ...
> [1]  + Done                          ( cd $PWD; $t $ttt; rm -f 
> .lock_$lockfile[$p] ) >> .time1_$loop
> bccTi54Htet.scf1up_1: No such file or directory.
> grep: No match.
> grep: No match.
> grep: No match.
>
> if WIEN_MPIRUN "mpirun -n _NP_ -machinefile _HOSTS_ _EXEC_"
> the output is:
>  LAPW0 END
> Abort(0) on node 0 (rank 0 in comm 0): application called 
> MPI_Abort(MPI_COMM_WORLD, 0) - process 0
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> Abort(9) on node 0 (rank 0 in comm 0): application called 
> MPI_Abort(MPI_COMM_WORLD, 9) - process 0
> w2k_dispatch_signal(): received: Terminated
> ...
> Abort(-1694629136) on node 11 (rank 11 in comm 0): application called 
> MPI_Abort(MPI_COMM_WORLD, -1694629136) - process 11
> [cli_11]: readline failed
> Abort(2118074352) on node 2 (rank 2 in comm 0): application called 
> MPI_Abort(MPI_COMM_WORLD, 2118074352) - process 2
> [cli_2]: readline failed
> WIEN2K ABORTING
> [cli_1]: readline failed
> WIEN2K ABORTING
>
>
>
> --- Исходное сообщение ---
> От кого: "Peter Blaha" <pblaha at theochem.tuwien.ac.at>
> Дата: 7 мая 2019, 09:14:44
>
>     When setting USE_REMOTE=0 it means, that you do not use "ssh" in
>     k-parallel mode.
>     This has the following consequences:
>     What you write for "hostname" in .machines is not important, only the
>     number of lines counts. And it will span as many k-parallel jobs as you
>     have lines (1:hostname), but they all will run ONLY on the "masternode",
>     i.e. you can use only ONE node within your slurm job.
>
>     When you use mpi-parallel (with MPI_REMOTE=0 AND MPIRUN command is the
>     "srun ..." command), it will use a srun command to span the mpi job, not
>     the usual mpirun command. In this case, however, "hostname" must be the
>     real name of the nodes where you want to run. The slurm-script as to
>     find out the node-names and insert them properly.
>
>     Am 06.05.2019  um 14:23 schriebwebfinder at ukr.net  <mailto:webfinder at ukr.net>:
>     > Dear wien2k users,
>     > 
>     > wien2k_18.2
>     > I'm trying to run a test task on a cluster with slurm batch system using 
>     > mpi parallelization.
>     > 
>     > In "parallel_options" USE_REMOTE=0, MPI_REMOTE=0.
>     > (during the siteconfig_lapw the slurm option was chosen)
>     > 
>     > the k-point parallelization works well. But if I change the "slurm.job" 
>     > script to produce .machines file for mpi run
>     > (e.g. from
>     > 1: hostname
>     > 1: hostname
>     > ....
>     > to
>     > 1: hostname hostname ....)
>     > 
>     > there is always a error message:
>     > permission_denied, please try again.
>     > permission_denied, please try again
>     > permission_denied, please try again (....)
>     > 
>     > How can I solve this?
>     > How could it be that k-point parallelization works but mpi not?
>     > 
>     > P.S. I have also tried after getting "nodelist" from batch system to 
>     > include ssh-copy-id command to slurm.job script to copy the keys but the 
>     > result is the same.
>     > 
>     > Thank you for the answers!
>     > 
>     > 
>     > 
>     > _______________________________________________
>     > Wien mailing list
>     > Wien at zeus.theochem.tuwien.ac.at  <mailto:Wien at zeus.theochem.tuwien.ac.at>
>     > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>     > SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>     > 
>
>     -- 
>     --------------------------------------------------------------------------
>     Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060  Vienna
>     Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
>     Email:blaha at theochem.tuwien.ac.at  <mailto:blaha at theochem.tuwien.ac.at>     WIEN2k:http://www.wien2k.at
>     WWW:
>     http://www.imc.tuwien.ac.at/tc_blaha-------------------------------------------------------------------------  
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20190507/400cea04/attachment.html>


More information about the Wien mailing list