[Wien] slurm mpi
webfinder at ukr.net
webfinder at ukr.net
Tue May 7 13:47:18 CEST 2019
Dear Prof. BlahaThank you!
The description of script for cluster is here
https://redmine.mcia.univ-bordeaux.fr/projects/cluster-curta/wiki/Slurm
(unfortunately it is in french and I'm not strong in cluster structures)
yes, the cluster uses "module" system. I'v used commands like "module load ..." in .bashrc and slurm.job (In addition I include direct path to the compiller and mpi with "source" command in .bashrc).
To compile WIEN I used intel 2019.3.199.
FFTW 3.3.8 I have compiled by myself.
The WIEN2k compilation was with no errors.
The lapw1_mpi has been compiled with default options. Only the direct path to libraries was specified
P.S. I cant reproduce the previous errors. Now running mpi, I got "permission denied" error with MPI_REMOTE=0
--- Исходное сообщение ---
От кого: "Peter Blaha" <pblaha at theochem.tuwien.ac.at>
Дата: 7 мая 2019, 13:08:58
So it seems that your cluster forbids to use ssh (even on assigned
nodes). If this is the case. you MUST use USE_REMOTE=0 and with
k-parallel mode you can use only one node (32 cores).
For mpi I do not know. There should be some "userguide" (web-site,
wicki, ...) for your cluster, where all details how to use the cluster
are listed. In particular they should say:
Which mpi + mkl + fftw you should use during compilation (maybe you
have a "module" system ?). (You did not say anything how you compiled
lapw1_mpi ?)
How to execute a mpi job. On some clusters the standard "mpirun"
command is no longer supported, and on our cluster we have to use srun
instead.
I don't know about your cluster, this depends on the SLURM version and
the specific setup of the cluster.
PS: A possibility for the lapw1_mpi problems is always a mismatch
between mpi and blacs and Scalapack. Did you ever try to run dstart
or lapw0 in mpi mode. These are more "simple" mpi-programs as they do
not use SCALAPACK.
On 5/7/19 11:33 AM, webfinder at ukr.net wrote:
> Dear Prof. Blaha
>
> thank you for the explanation!
> Sorry, I should put hostname in quotes. Script I used is based on that
> in the WIEN-FAQ and produce .machines based on the nodes provided by the
> slurm:
> for k-points:
> #
> 1:n270
> 1:n270
> 1:n270
> 1:n270
> 1:n270
> ....
> granularity:1
> extrafine:1
>
> for mpi:
> #
> 1:n270 n270 n270 n270 n270 ....
> granularity:1
> extrafine:1
>
> After I changed USE_REMOTE to 1 the "Permission denied, please try
> again" appears also for k-point parallelization.
> As it is stated in the userguide I did things like "ssh-keygen" and copy
> to "authorized_keys" but result is the same.
> As a "low-level" user on a cluster I dont have any permission to login
> to the nodes.
>
> For k-point parallelezation with USE_REMOTE=1 the *.out file has the lines:
>
> Got 96 cores nodelist n[270-272] tasks_per_node 32 jobs_per_node 32
> because OMP_NUM_THREADS = 1 96 nodes for this job: n270 n270 n270 n270
> n270 n270 ....
> 10:04:01 up 18 days, 58 min, 0 users, load average: 0.04, 0.04, 0.07
> USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
> ...
> -------- .machine0 : processors
> running dstart in single mode C T F DSTART ENDS 22.030u 0.102s 0:22.20
> 99.6% 0+0k 0+0io 0pf+0w LAPW0 END full diagonalization forced Permission
> denied, please try again. Permission denied, please try again.
> Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
> [1] + Done ( ( $remote $machine[$p] "cd $PWD;$t $taskset0 $exe
> ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p]
> ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw
> .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop;
> grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" ) Permission
> denied, please try again. Permission denied, please try again.
> Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
> ...
>
>
> For mpi-parallelization with USE_REMOTE=1, MPI_REMOTE=0, WIEN_MPIRUN
> "srun ..."
> the output is:
> LAPW0 END
> Abort(0) on node 0 (rank 0 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 0) - process 0
> Abort(0) on node 0 (rank 0 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 0) - process 0
> ...
> [1] + Done ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
> bccTi54Htet.scf1up_1: No such file or directory.
> grep: No match.
> grep: No match.
> grep: No match.
>
> if WIEN_MPIRUN "mpirun -n _NP_ -machinefile _HOSTS_ _EXEC_"
> the output is:
> LAPW0 END
> Abort(0) on node 0 (rank 0 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 0) - process 0
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> Abort(9) on node 0 (rank 0 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 9) - process 0
> w2k_dispatch_signal(): received: Terminated
> ...
> Abort(-1694629136) on node 11 (rank 11 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, -1694629136) - process 11
> [cli_11]: readline failed
> Abort(2118074352) on node 2 (rank 2 in comm 0): application called
> MPI_Abort(MPI_COMM_WORLD, 2118074352) - process 2
> [cli_2]: readline failed
> WIEN2K ABORTING
> [cli_1]: readline failed
> WIEN2K ABORTING
>
>
>
> --- Исходное сообщение ---
> От кого: "Peter Blaha" <pblaha at theochem.tuwien.ac.at>
> Дата: 7 мая 2019, 09:14:44
>
> When setting USE_REMOTE=0 it means, that you do not use "ssh" in
> k-parallel mode.
> This has the following consequences:
> What you write for "hostname" in .machines is not important, only the
> number of lines counts. And it will span as many k-parallel jobs as you
> have lines (1:hostname), but they all will run ONLY on the "masternode",
> i.e. you can use only ONE node within your slurm job.
>
> When you use mpi-parallel (with MPI_REMOTE=0 AND MPIRUN command is the
> "srun ..." command), it will use a srun command to span the mpi job, not
> the usual mpirun command. In this case, however, "hostname" must be the
> real name of the nodes where you want to run. The slurm-script as to
> find out the node-names and insert them properly.
>
> Am 06.05.2019 um 14:23 schriebwebfinder at ukr.net <mailto:webfinder at ukr.net>:
> > Dear wien2k users,
> >
> > wien2k_18.2
> > I'm trying to run a test task on a cluster with slurm batch system using
> > mpi parallelization.
> >
> > In "parallel_options" USE_REMOTE=0, MPI_REMOTE=0.
> > (during the siteconfig_lapw the slurm option was chosen)
> >
> > the k-point parallelization works well. But if I change the "slurm.job"
> > script to produce .machines file for mpi run
> > (e.g. from
> > 1: hostname
> > 1: hostname
> > ....
> > to
> > 1: hostname hostname ....)
> >
> > there is always a error message:
> > permission_denied, please try again.
> > permission_denied, please try again
> > permission_denied, please try again (....)
> >
> > How can I solve this?
> > How could it be that k-point parallelization works but mpi not?
> >
> > P.S. I have also tried after getting "nodelist" from batch system to
> > include ssh-copy-id command to slurm.job script to copy the keys but the
> > result is the same.
> >
> > Thank you for the answers!
> >
> >
> >
> > _______________________________________________
> > Wien mailing list
> > Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> > SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> >
>
> --
> --------------------------------------------------------------------------
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
> Email:blaha at theochem.tuwien.ac.at <mailto:blaha at theochem.tuwien.ac.at> WIEN2k:http://www.wien2k.at
> WWW:
> http://www.imc.tuwien.ac.at/tc_blaha-------------------------------------------------------------------------
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
--
P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at WIEN2k: http://www.wien2k.at
WWW: http://www.imc.tuwien.ac.at/TC_Blaha
--------------------------------------------------------------------------
_______________________________________________
Wien mailing list
Wien at zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20190507/16786cee/attachment.html>
More information about the Wien
mailing list