[Wien] slurm mpi

Peter Blaha pblaha at theochem.tuwien.ac.at
Tue May 7 14:13:04 CEST 2019


Not enough info. I briefly checked your wiki (I have no idea of French), 
but you seem to have Intelmpi (which I would recommend).

What mpi are you loading ?

Did you load all modules also in the batch job

What scalapack ?

What blacs-library ?

Post your OPTION files from $WIENROOT
and also the "important part" of .bashrc (module load ...)

PS: From the wiki I saw that mpiexec (or mpirun) seems to be supported.

I also saw that one can have an interactive node. Usually, this makes 
these tests much simpler than doing it via batch jobs.
On the interactive node, make sure all modules are loaded, change into 
the proper directory and issue:

mpirun -np 4 $WIENROOT/lapw0_mpi lapw0.def




On 5/7/19 1:47 PM, webfinder at ukr.net wrote:
> Dear Prof. Blaha
> Thank you!
> 
> The description of script for cluster is here
> https://redmine.mcia.univ-bordeaux.fr/projects/cluster-curta/wiki/Slurm
> (unfortunately it is in french and I'm not strong in cluster structures)
> 
> yes, the cluster uses "module" system. I'v used commands like "module 
> load ..." in .bashrc and slurm.job (In addition I include direct path to 
> the compiller and mpi with "source" command in .bashrc).
> To compile WIEN I used intel 2019.3.199.
> FFTW 3.3.8 I have compiled by myself.
> The WIEN2k compilation was with no errors.
> The lapw1_mpi has been compiled with default options. Only the direct 
> path to libraries was specified
> 
> P.S. I cant reproduce the previous errors. Now running mpi, I got 
> "permission denied" error with MPI_REMOTE=0
> 
> 
> --- Исходное сообщение ---
> От кого: "Peter Blaha" <pblaha at theochem.tuwien.ac.at>
> Дата: 7 мая 2019, 13:08:58
> 
>     So it seems that your cluster forbids to use ssh (even on assigned
>     nodes). If this is the case. you MUST use   USE_REMOTE=0 and with
>     k-parallel mode you can use only one node (32 cores).
> 
>     For mpi I do not know. There should be some "userguide" (web-site,
>     wicki, ...) for your cluster, where all details how to use the cluster
>     are listed. In particular they should say:
> 
>     Which mpi + mkl + fftw  you should use during compilation (maybe you
>     have a "module" system ?). (You did not say anything how you compiled
>     lapw1_mpi ?)
>     How to execute a mpi job. On some clusters    the standard "mpirun"
>     command is no longer supported, and on our cluster we have to use   srun
>         instead.
> 
>     I don't know about your cluster, this depends on the SLURM version and
>     the specific setup of the cluster.
> 
>     PS: A possibility for the lapw1_mpi problems is always a mismatch
>     between mpi and blacs and Scalapack. Did you ever try to run    dstart
>     or lapw0 in mpi mode. These are more "simple" mpi-programs as they do
>     not use SCALAPACK.
> 
>     On 5/7/19 11:33 AM,webfinder at ukr.net  <mailto:webfinder at ukr.net>  wrote:
>     > Dear Prof. Blaha
>     > 
>     > thank you for the explanation!
>     > Sorry, I should put hostname in quotes. Script I used is based on that 
>     > in the WIEN-FAQ and produce .machines based on the nodes provided by the 
>     > slurm:
>     > for k-points:
>     > #
>     > 1:n270
>     > 1:n270
>     > 1:n270
>     > 1:n270
>     > 1:n270
>     > ....
>     > granularity:1
>     > extrafine:1
>     > 
>     > for mpi:
>     > #
>     > 1:n270 n270 n270 n270 n270 ....
>     > granularity:1
>     > extrafine:1
>     > 
>     > After I changed USE_REMOTE to 1 the "Permission denied, please try 
>     > again" appears also for k-point parallelization.
>     > As it is stated in the userguide I did things like "ssh-keygen" and copy 
>     > to "authorized_keys" but result is the same.
>     > As a "low-level" user on a cluster I dont have any permission to login 
>     > to the nodes.
>     > 
>     > For k-point parallelezation with USE_REMOTE=1 the *.out file has the lines:
>     > 
>     > Got 96 cores nodelist n[270-272] tasks_per_node 32 jobs_per_node 32
>     > because OMP_NUM_THREADS = 1 96 nodes for this job: n270 n270 n270 n270 
>     > n270 n270 ....
>     > 10:04:01 up 18 days, 58 min, 0 users, load average: 0.04, 0.04, 0.07 
>     > USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
>     > ...
>     > -------- .machine0 : processors
>     > running dstart in single mode C T F DSTART ENDS 22.030u 0.102s 0:22.20 
>     > 99.6% 0+0k 0+0io 0pf+0w LAPW0 END full diagonalization forced Permission 
>     > denied, please try again. Permission denied, please try again. 
>     > Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
>     > [1] + Done ( ( $remote $machine[$p] "cd $PWD;$t $taskset0 $exe 
>     > ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] 
>     > ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw 
>     > .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; 
>     > grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" ) Permission 
>     > denied, please try again. Permission denied, please try again. 
>     > Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
>     > ...
>     > 
>     > 
>     > For mpi-parallelization with USE_REMOTE=1, MPI_REMOTE=0, WIEN_MPIRUN 
>     > "srun ..."
>     > the output is:
>     > LAPW0 END
>     > Abort(0) on node 0 (rank 0 in comm 0): application called 
>     > MPI_Abort(MPI_COMM_WORLD, 0) - process 0
>     > Abort(0) on node 0 (rank 0 in comm 0): application called 
>     > MPI_Abort(MPI_COMM_WORLD, 0) - process 0
>     > ...
>     > [1]  + Done                          ( cd $PWD; $t $ttt; rm -f 
>     > .lock_$lockfile[$p] ) >> .time1_$loop
>     > bccTi54Htet.scf1up_1: No such file or directory.
>     > grep: No match.
>     > grep: No match.
>     > grep: No match.
>     > 
>     > if WIEN_MPIRUN "mpirun -n _NP_ -machinefile _HOSTS_ _EXEC_"
>     > the output is:
>     >   LAPW0 END
>     > Abort(0) on node 0 (rank 0 in comm 0): application called 
>     > MPI_Abort(MPI_COMM_WORLD, 0) - process 0
>     > w2k_dispatch_signal(): received: Terminated
>     > w2k_dispatch_signal(): received: Terminated
>     > Abort(9) on node 0 (rank 0 in comm 0): application called 
>     > MPI_Abort(MPI_COMM_WORLD, 9) - process 0
>     > w2k_dispatch_signal(): received: Terminated
>     > ...
>     > Abort(-1694629136) on node 11 (rank 11 in comm 0): application called 
>     > MPI_Abort(MPI_COMM_WORLD, -1694629136) - process 11
>     > [cli_11]: readline failed
>     > Abort(2118074352) on node 2 (rank 2 in comm 0): application called 
>     > MPI_Abort(MPI_COMM_WORLD, 2118074352) - process 2
>     > [cli_2]: readline failed
>     > WIEN2K ABORTING
>     > [cli_1]: readline failed
>     > WIEN2K ABORTING
>     > 
>     > 
>     > 
>     > --- Исходное сообщение ---
>     > От кого: "Peter Blaha" <pblaha at theochem.tuwien.ac.at  <mailto:pblaha at theochem.tuwien.ac.at>>
>     > Дата: 7 мая 2019, 09:14:44
>     > 
>     >     When setting USE_REMOTE=0 it means, that you do not use "ssh" in
>     >     k-parallel mode.
>     >     This has the following consequences:
>     >     What you write for "hostname" in .machines is not important, only the
>     >     number of lines counts. And it will span as many k-parallel jobs as you
>     >     have lines (1:hostname), but they all will run ONLY on the "masternode",
>     >     i.e. you can use only ONE node within your slurm job.
>     > 
>     >     When you use mpi-parallel (with MPI_REMOTE=0 AND MPIRUN command is the
>     >     "srun ..." command), it will use a srun command to span the mpi job, not
>     >     the usual mpirun command. In this case, however, "hostname" must be the
>     >     real name of the nodes where you want to run. The slurm-script as to
>     >     find out the node-names and insert them properly.
>     > 
>     >     Am 06.05.2019   um 14:23schriebwebfinder at ukr.net  <mailto:schriebwebfinder at ukr.net>   <mailto:webfinder at ukr.net>:
>     >     > Dear wien2k users,
>     >     > 
>     >     > wien2k_18.2
>     >     > I'm trying to run a test task on a cluster with slurm batch system using 
>     >     > mpi parallelization.
>     >     > 
>     >     > In "parallel_options" USE_REMOTE=0, MPI_REMOTE=0.
>     >     > (during the siteconfig_lapw the slurm option was chosen)
>     >     > 
>     >     > the k-point parallelization works well. But if I change the "slurm.job" 
>     >     > script to produce .machines file for mpi run
>     >     > (e.g. from
>     >     > 1: hostname
>     >     > 1: hostname
>     >     > ....
>     >     > to
>     >     > 1: hostname hostname ....)
>     >     > 
>     >     > there is always a error message:
>     >     > permission_denied, please try again.
>     >     > permission_denied, please try again
>     >     > permission_denied, please try again (....)
>     >     > 
>     >     > How can I solve this?
>     >     > How could it be that k-point parallelization works but mpi not?
>     >     > 
>     >     > P.S. I have also tried after getting "nodelist" from batch system to 
>     >     > include ssh-copy-id command to slurm.job script to copy the keys but the 
>     >     > result is the same.
>     >     > 
>     >     > Thank you for the answers!
>     >     > 
>     >     > 
>     >     > 
>     >     > _______________________________________________
>     >     > Wien mailing list
>     >     > Wien at zeus.theochem.tuwien.ac.at  <mailto:Wien at zeus.theochem.tuwien.ac.at>   <mailto:Wien at zeus.theochem.tuwien.ac.at>
>     >     > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>     >     > SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>     >     > 
>     > 
>     >     -- 
>     >     --------------------------------------------------------------------------
>     >     Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060   Vienna
>     >     Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
>     >     Email:blaha at theochem.tuwien.ac.at  <mailto:blaha at theochem.tuwien.ac.at>     WIEN2k:http://www.wien2k.at
>     >     WWW:
>     >     http://www.imc.tuwien.ac.at/tc_blaha-------------------------------------------------------------------------   
>     > 
>     >     _______________________________________________
>     >     Wien mailing list
>     >     Wien at zeus.theochem.tuwien.ac.at  <mailto:Wien at zeus.theochem.tuwien.ac.at>   <mailto:Wien at zeus.theochem.tuwien.ac.at>
>     >     http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>     >     SEARCH the MAILING-LIST at:http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>     > 
>     > 
>     > _______________________________________________
>     > Wien mailing list
>     > Wien at zeus.theochem.tuwien.ac.at  <mailto:Wien at zeus.theochem.tuwien.ac.at>
>     > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>     > SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>     > 
> 
>     -- 
> 
>                                             P.Blaha
>     --------------------------------------------------------------------------
>     Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060  Vienna
>     Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
>     Email:blaha at theochem.tuwien.ac.at  <mailto:blaha at theochem.tuwien.ac.at>     WIEN2k:http://www.wien2k.at
>     WWW:http://www.imc.tuwien.ac.at/TC_Blaha
>     --------------------------------------------------------------------------
>     _______________________________________________
>     Wien mailing list
>     Wien at zeus.theochem.tuwien.ac.at  <mailto:Wien at zeus.theochem.tuwien.ac.at>
>     http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>     SEARCH the MAILING-LIST at:http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> 
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> 

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/TC_Blaha
--------------------------------------------------------------------------


More information about the Wien mailing list