[Wien] slurm mpi

webfinder at ukr.net webfinder at ukr.net
Tue May 7 11:33:58 CEST 2019


Dear Prof. Blaha
thank you for the explanation!
Sorry, I should put hostname in quotes. Script I used is based on that in the WIEN-FAQ and produce .machines based on the nodes provided by the slurm:
for k-points:
#
1:n270 
1:n270 
1:n270 
1:n270 
1:n270
....
granularity:1
extrafine:1

for mpi:
#
1:n270 n270 n270 n270 n270 ....
granularity:1
extrafine:1

After I changed USE_REMOTE to 1 the "Permission denied, please try again" appears also for k-point parallelization.
As it is stated in the userguide I did things like "ssh-keygen" and copy to "authorized_keys" but result is the same.
As a "low-level" user on a cluster I dont have any permission to login to the nodes.

For k-point parallelezation with USE_REMOTE=1 the *.out file has the lines:

Got 96 cores nodelist n[270-272] tasks_per_node 32 jobs_per_node 32 because OMP_NUM_THREADS = 1 96 nodes for this job: n270 n270 n270 n270 n270 n270 ....
10:04:01 up 18 days, 58 min, 0 users, load average: 0.04, 0.04, 0.07 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
...
-------- .machine0 : processors
running dstart in single mode C T F DSTART ENDS 22.030u 0.102s 0:22.20 99.6% 0+0k 0+0io 0pf+0w LAPW0 END full diagonalization forced Permission denied, please try again. Permission denied, please try again. Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
[1] + Done ( ( $remote $machine[$p] "cd $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" ) Permission denied, please try again. Permission denied, please try again. Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
...


For mpi-parallelization with USE_REMOTE=1, MPI_REMOTE=0, WIEN_MPIRUN "srun ..."
the output is:
LAPW0 END
Abort(0) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
Abort(0) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
...
[1]  + Done                          ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
bccTi54Htet.scf1up_1: No such file or directory.
grep: No match.
grep: No match.
grep: No match.

if WIEN_MPIRUN "mpirun -n _NP_ -machinefile _HOSTS_ _EXEC_"
the output is:
 LAPW0 END
Abort(0) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
Abort(9) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 9) - process 0
w2k_dispatch_signal(): received: Terminated
...
Abort(-1694629136) on node 11 (rank 11 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1694629136) - process 11
[cli_11]: readline failed
Abort(2118074352) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 2118074352) - process 2
[cli_2]: readline failed
WIEN2K ABORTING
[cli_1]: readline failed
WIEN2K ABORTING



--- Исходное сообщение ---
От кого: "Peter Blaha" <pblaha at theochem.tuwien.ac.at>
Дата: 7 мая 2019, 09:14:44

When setting USE_REMOTE=0 it means, that you do not use "ssh" in 
k-parallel mode.
This has the following consequences:
What you write for "hostname" in .machines is not important, only the 
number of lines counts. And it will span as many k-parallel jobs as you 
have lines (1:hostname), but they all will run ONLY on the "masternode", 
i.e. you can use only ONE node within your slurm job.

When you use mpi-parallel (with MPI_REMOTE=0 AND MPIRUN command is the 
"srun ..." command), it will use a srun command to span the mpi job, not 
the usual mpirun command. In this case, however, "hostname" must be the 
real name of the nodes where you want to run. The slurm-script as to 
find out the node-names and insert them properly.

Am 06.05.2019 um 14:23 schrieb webfinder at ukr.net:
> Dear wien2k users,
> 
> wien2k_18.2
> I'm trying to run a test task on a cluster with slurm batch system using 
> mpi parallelization.
> 
> In "parallel_options" USE_REMOTE=0, MPI_REMOTE=0.
> (during the siteconfig_lapw the slurm option was chosen)
> 
> the k-point parallelization works well. But if I change the "slurm.job" 
> script to produce .machines file for mpi run
> (e.g. from
> 1: hostname
> 1: hostname
> ....
> to
> 1: hostname hostname ....)
> 
> there is always a error message:
> permission_denied, please try again.
> permission_denied, please try again
> permission_denied, please try again (....)
> 
> How can I solve this?
> How could it be that k-point parallelization works but mpi not?
> 
> P.S. I have also tried after getting "nodelist" from batch system to 
> include ssh-copy-id command to slurm.job script to copy the keys but the 
> result is the same.
> 
> Thank you for the answers!
> 
> 
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> 

-- 
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW: 
http://www.imc.tuwien.ac.at/tc_blaha------------------------------------------------------------------------- 

_______________________________________________
Wien mailing list
Wien at zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20190507/a589e09d/attachment.html>


More information about the Wien mailing list