[Wien] slurm mpi
Peter Blaha
pblaha at theochem.tuwien.ac.at
Tue May 7 12:08:54 CEST 2019
So it seems that your cluster forbids to use ssh (even on assigned
nodes). If this is the case. you MUST use USE_REMOTE=0 and with
k-parallel mode you can use only one node (32 cores).
For mpi I do not know. There should be some "userguide" (web-site,
wicki, ...) for your cluster, where all details how to use the cluster
are listed. In particular they should say:
Which mpi + mkl + fftw you should use during compilation (maybe you
have a "module" system ?). (You did not say anything how you compiled
lapw1_mpi ?)
How to execute a mpi job. On some clusters the standard "mpirun"
command is no longer supported, and on our cluster we have to use srun
instead.
I don't know about your cluster, this depends on the SLURM version and
the specific setup of the cluster.
PS: A possibility for the lapw1_mpi problems is always a mismatch
between mpi and blacs and Scalapack. Did you ever try to run dstart
or lapw0 in mpi mode. These are more "simple" mpi-programs as they do
not use SCALAPACK.
On 5/7/19 11:33 AM, webfinder at ukr.net wrote:
> Dear Prof. Blaha
>
> thank you for the explanation!
> Sorry, I should put hostname in quotes. Script I used is based on that
> in the WIEN-FAQ and produce .machines based on the nodes provided by the
> slurm:
> for k-points:
> #
> 1:n270
> 1:n270
> 1:n270
> 1:n270
> 1:n270
> ....
> granularity:1
> extrafine:1
>
> for mpi:
> #
> 1:n270 n270 n270 n270 n270 ....
> granularity:1
> extrafine:1
>
> After I changed USE_REMOTE to 1 the "Permission denied, please try
> again" appears also for k-point parallelization.
> As it is stated in the userguide I did things like "ssh-keygen" and copy
> to "authorized_keys" but result is the same.
> As a "low-level" user on a cluster I dont have any permission to login
> to the nodes.
>
> For k-point parallelezation with USE_REMOTE=1 the *.out file has the lines:
>
> Got 96 cores nodelist n[270-272] tasks_per_node 32 jobs_per_node 32
> because OMP_NUM_THREADS = 1 96 nodes for this job: n270 n270 n270 n270
> n270 n270 ....
> 10:04:01 up 18 days, 58 min, 0 users, load average: 0.04, 0.04, 0.07
> USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
> ...
> -------- .machine0 : processors
> running dstart in single mode C T F DSTART ENDS 22.030u 0.102s 0:22.20
> 99.6% 0+0k 0+0io 0pf+0w LAPW0 END full diagonalization forced Permission
> denied, please try again. Permission denied, please try again.
> Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
> [1] + Done ( ( $remote $machine[$p] "cd $PWD;$t $taskset0 $exe
> ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p]
> ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw
> .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop;
> grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" ) Permission
> denied, please try again. Permission denied, please try again.
> Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
> ...
>
>
> For mpi-parallelization with USE_REMOTE=1, MPI_REMOTE=0, WIEN_MPIRUN
> "srun ..."
> the output is:
> LAPW0 END
> Abort(0) on node 0 (rank 0 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 0) - process 0
> Abort(0) on node 0 (rank 0 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 0) - process 0
> ...
> [1] + Done ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
> bccTi54Htet.scf1up_1: No such file or directory.
> grep: No match.
> grep: No match.
> grep: No match.
>
> if WIEN_MPIRUN "mpirun -n _NP_ -machinefile _HOSTS_ _EXEC_"
> the output is:
> LAPW0 END
> Abort(0) on node 0 (rank 0 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 0) - process 0
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> Abort(9) on node 0 (rank 0 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 9) - process 0
> w2k_dispatch_signal(): received: Terminated
> ...
> Abort(-1694629136) on node 11 (rank 11 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, -1694629136) - process 11
> [cli_11]: readline failed
> Abort(2118074352) on node 2 (rank 2 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 2118074352) - process 2
> [cli_2]: readline failed
> WIEN2K ABORTING
> [cli_1]: readline failed
> WIEN2K ABORTING
>
>
>
> --- Исходное сообщение ---
> От кого: "Peter Blaha" <pblaha at theochem.tuwien.ac.at>
> Дата: 7 мая 2019, 09:14:44
>
> When setting USE_REMOTE=0 it means, that you do not use "ssh" in
> k-parallel mode.
> This has the following consequences:
> What you write for "hostname" in .machines is not important, only the
> number of lines counts. And it will span as many k-parallel jobs as you
> have lines (1:hostname), but they all will run ONLY on the "masternode",
> i.e. you can use only ONE node within your slurm job.
>
> When you use mpi-parallel (with MPI_REMOTE=0 AND MPIRUN command is the
> "srun ..." command), it will use a srun command to span the mpi job, not
> the usual mpirun command. In this case, however, "hostname" must be the
> real name of the nodes where you want to run. The slurm-script as to
> find out the node-names and insert them properly.
>
> Am 06.05.2019 um 14:23 schriebwebfinder at ukr.net <mailto:webfinder at ukr.net>:
> > Dear wien2k users,
> >
> > wien2k_18.2
> > I'm trying to run a test task on a cluster with slurm batch system using
> > mpi parallelization.
> >
> > In "parallel_options" USE_REMOTE=0, MPI_REMOTE=0.
> > (during the siteconfig_lapw the slurm option was chosen)
> >
> > the k-point parallelization works well. But if I change the "slurm.job"
> > script to produce .machines file for mpi run
> > (e.g. from
> > 1: hostname
> > 1: hostname
> > ....
> > to
> > 1: hostname hostname ....)
> >
> > there is always a error message:
> > permission_denied, please try again.
> > permission_denied, please try again
> > permission_denied, please try again (....)
> >
> > How can I solve this?
> > How could it be that k-point parallelization works but mpi not?
> >
> > P.S. I have also tried after getting "nodelist" from batch system to
> > include ssh-copy-id command to slurm.job script to copy the keys but the
> > result is the same.
> >
> > Thank you for the answers!
> >
> >
> >
> > _______________________________________________
> > Wien mailing list
> > Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> > SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> >
>
> --
> --------------------------------------------------------------------------
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
> Email:blaha at theochem.tuwien.ac.at <mailto:blaha at theochem.tuwien.ac.at> WIEN2k:http://www.wien2k.at
> WWW:
> http://www.imc.tuwien.ac.at/tc_blaha-------------------------------------------------------------------------
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
--
P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at WIEN2k: http://www.wien2k.at
WWW: http://www.imc.tuwien.ac.at/TC_Blaha
--------------------------------------------------------------------------
More information about the Wien
mailing list