[Wien] slurm mpi

Peter Blaha pblaha at theochem.tuwien.ac.at
Tue May 7 12:08:54 CEST 2019


So it seems that your cluster forbids to use ssh (even on assigned 
nodes). If this is the case. you MUST use   USE_REMOTE=0 and with 
k-parallel mode you can use only one node (32 cores).

For mpi I do not know. There should be some "userguide" (web-site, 
wicki, ...) for your cluster, where all details how to use the cluster 
are listed. In particular they should say:

Which mpi + mkl + fftw  you should use during compilation (maybe you 
have a "module" system ?). (You did not say anything how you compiled 
lapw1_mpi ?)
How to execute a mpi job. On some clusters    the standard "mpirun" 
command is no longer supported, and on our cluster we have to use   srun 
   instead.

I don't know about your cluster, this depends on the SLURM version and 
the specific setup of the cluster.

PS: A possibility for the lapw1_mpi problems is always a mismatch 
between mpi and blacs and Scalapack. Did you ever try to run    dstart 
or lapw0 in mpi mode. These are more "simple" mpi-programs as they do 
not use SCALAPACK.

On 5/7/19 11:33 AM, webfinder at ukr.net wrote:
> Dear Prof. Blaha
> 
> thank you for the explanation!
> Sorry, I should put hostname in quotes. Script I used is based on that 
> in the WIEN-FAQ and produce .machines based on the nodes provided by the 
> slurm:
> for k-points:
> #
> 1:n270
> 1:n270
> 1:n270
> 1:n270
> 1:n270
> ....
> granularity:1
> extrafine:1
> 
> for mpi:
> #
> 1:n270 n270 n270 n270 n270 ....
> granularity:1
> extrafine:1
> 
> After I changed USE_REMOTE to 1 the "Permission denied, please try 
> again" appears also for k-point parallelization.
> As it is stated in the userguide I did things like "ssh-keygen" and copy 
> to "authorized_keys" but result is the same.
> As a "low-level" user on a cluster I dont have any permission to login 
> to the nodes.
> 
> For k-point parallelezation with USE_REMOTE=1 the *.out file has the lines:
> 
> Got 96 cores nodelist n[270-272] tasks_per_node 32 jobs_per_node 32 
> because OMP_NUM_THREADS = 1 96 nodes for this job: n270 n270 n270 n270 
> n270 n270 ....
> 10:04:01 up 18 days, 58 min, 0 users, load average: 0.04, 0.04, 0.07 
> USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
> ...
> -------- .machine0 : processors
> running dstart in single mode C T F DSTART ENDS 22.030u 0.102s 0:22.20 
> 99.6% 0+0k 0+0io 0pf+0w LAPW0 END full diagonalization forced Permission 
> denied, please try again. Permission denied, please try again. 
> Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
> [1] + Done ( ( $remote $machine[$p] "cd $PWD;$t $taskset0 $exe 
> ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] 
> ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw 
> .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; 
> grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" ) Permission 
> denied, please try again. Permission denied, please try again. 
> Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
> ...
> 
> 
> For mpi-parallelization with USE_REMOTE=1, MPI_REMOTE=0, WIEN_MPIRUN 
> "srun ..."
> the output is:
> LAPW0 END
> Abort(0) on node 0 (rank 0 in comm 0): application called 
> MPI_Abort(MPI_COMM_WORLD, 0) - process 0
> Abort(0) on node 0 (rank 0 in comm 0): application called 
> MPI_Abort(MPI_COMM_WORLD, 0) - process 0
> ...
> [1]  + Done                          ( cd $PWD; $t $ttt; rm -f 
> .lock_$lockfile[$p] ) >> .time1_$loop
> bccTi54Htet.scf1up_1: No such file or directory.
> grep: No match.
> grep: No match.
> grep: No match.
> 
> if WIEN_MPIRUN "mpirun -n _NP_ -machinefile _HOSTS_ _EXEC_"
> the output is:
>   LAPW0 END
> Abort(0) on node 0 (rank 0 in comm 0): application called 
> MPI_Abort(MPI_COMM_WORLD, 0) - process 0
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> Abort(9) on node 0 (rank 0 in comm 0): application called 
> MPI_Abort(MPI_COMM_WORLD, 9) - process 0
> w2k_dispatch_signal(): received: Terminated
> ...
> Abort(-1694629136) on node 11 (rank 11 in comm 0): application called 
> MPI_Abort(MPI_COMM_WORLD, -1694629136) - process 11
> [cli_11]: readline failed
> Abort(2118074352) on node 2 (rank 2 in comm 0): application called 
> MPI_Abort(MPI_COMM_WORLD, 2118074352) - process 2
> [cli_2]: readline failed
> WIEN2K ABORTING
> [cli_1]: readline failed
> WIEN2K ABORTING
> 
> 
> 
> --- Исходное сообщение ---
> От кого: "Peter Blaha" <pblaha at theochem.tuwien.ac.at>
> Дата: 7 мая 2019, 09:14:44
> 
>     When setting USE_REMOTE=0 it means, that you do not use "ssh" in
>     k-parallel mode.
>     This has the following consequences:
>     What you write for "hostname" in .machines is not important, only the
>     number of lines counts. And it will span as many k-parallel jobs as you
>     have lines (1:hostname), but they all will run ONLY on the "masternode",
>     i.e. you can use only ONE node within your slurm job.
> 
>     When you use mpi-parallel (with MPI_REMOTE=0 AND MPIRUN command is the
>     "srun ..." command), it will use a srun command to span the mpi job, not
>     the usual mpirun command. In this case, however, "hostname" must be the
>     real name of the nodes where you want to run. The slurm-script as to
>     find out the node-names and insert them properly.
> 
>     Am 06.05.2019  um 14:23 schriebwebfinder at ukr.net  <mailto:webfinder at ukr.net>:
>     > Dear wien2k users,
>     > 
>     > wien2k_18.2
>     > I'm trying to run a test task on a cluster with slurm batch system using 
>     > mpi parallelization.
>     > 
>     > In "parallel_options" USE_REMOTE=0, MPI_REMOTE=0.
>     > (during the siteconfig_lapw the slurm option was chosen)
>     > 
>     > the k-point parallelization works well. But if I change the "slurm.job" 
>     > script to produce .machines file for mpi run
>     > (e.g. from
>     > 1: hostname
>     > 1: hostname
>     > ....
>     > to
>     > 1: hostname hostname ....)
>     > 
>     > there is always a error message:
>     > permission_denied, please try again.
>     > permission_denied, please try again
>     > permission_denied, please try again (....)
>     > 
>     > How can I solve this?
>     > How could it be that k-point parallelization works but mpi not?
>     > 
>     > P.S. I have also tried after getting "nodelist" from batch system to 
>     > include ssh-copy-id command to slurm.job script to copy the keys but the 
>     > result is the same.
>     > 
>     > Thank you for the answers!
>     > 
>     > 
>     > 
>     > _______________________________________________
>     > Wien mailing list
>     > Wien at zeus.theochem.tuwien.ac.at  <mailto:Wien at zeus.theochem.tuwien.ac.at>
>     > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>     > SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>     > 
> 
>     -- 
>     --------------------------------------------------------------------------
>     Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060  Vienna
>     Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
>     Email:blaha at theochem.tuwien.ac.at  <mailto:blaha at theochem.tuwien.ac.at>     WIEN2k:http://www.wien2k.at
>     WWW:
>     http://www.imc.tuwien.ac.at/tc_blaha-------------------------------------------------------------------------  
> 
>     _______________________________________________
>     Wien mailing list
>     Wien at zeus.theochem.tuwien.ac.at  <mailto:Wien at zeus.theochem.tuwien.ac.at>
>     http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>     SEARCH the MAILING-LIST at:http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> 
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> 

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/TC_Blaha
--------------------------------------------------------------------------


More information about the Wien mailing list