[Wien] Error in mpi+k point parallelization across multiple nodes
Gavin Abo
gsabo at crimson.ua.edu
Thu May 7 01:11:03 CEST 2015
Ok, now it is clear that there is no additional error messages.
Unfortunately, I cannot tell specifically what went wrong from those
error messages.
You might try replacing mpirun with mpirun_rsh. As you can see at
http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2013-May/004402.html
they replaced mpirun with mpirun_rsh, and it seems that they found a
problem with passwordless ssh.
In your parallel_options file, you might also want to change "setenv
USE_REMOTE 0" to "setenv USE_REMOTE 1", then try both 0 and 1 for
MPI_REMOTE to check if any of these other configurations work or not
while using mpirun.
On 5/6/2015 11:23 AM, lung Fermin wrote:
> Thanks for the reply. Please see below.
>
>
> >As I asked before, did you give us all the error information in the
> case.dayfile and from standard output? It is not entirely clear in
> your previous posts, but it looks to me that you might have only
> provided information from the case.dayfile and the error files (cat
> *.error), but maybe not from the standard output. Are you still using
> the PBS script in your old post at
> http://www.mail-archive.com/wien%40zeus.theochem.tuwien.ac.at/msg11770.html ?
> In the script, I can see that the standard output is set to be written
> to a file called wien2k_output.
>
>
> Sorry for the confusion. Yes, I still use the PBS script in the above
> link. The posts before are from the standard outputs (wien2k). When
> using 2 nodes with 32 cores for one k point, the standard output gives
> ----------------------------
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17
> z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18
> z1-18 z1-1
> 8 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18
> number of processors: 32
> LAPW0 END
> [16] Failed to dealloc pd (Device or resource busy)
> [0] Failed to dealloc pd (Device or resource busy)
> [17] Failed to dealloc pd (Device or resource busy)
> [2] Failed to dealloc pd (Device or resource busy)
> [18] Failed to dealloc pd (Device or resource busy)
> [1] Failed to dealloc pd (Device or resource busy)
> LAPW1 END
> LAPW2 - FERMI; weighs written
> [z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291)
> terminated with signal 9 -> abort job
> [z1-17:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
> 9. MPI process died?
> [z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
> MPI process died?
> [z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
> z1-17 aborted: Error while reading a PMI socket (4)
> [z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file
> descriptor 21. MPI process died?
> [z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file
> descriptor 21. MPI process died?
> [z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI
> process died?
> cp: cannot stat `.in.tmp': No such file or directory
>
> > stop error
> -----------------------------
> And the .dayfile reads:
>
> on z1-17 with PID 29439
> using WIEN2k_14.2 (Release 15/10/2014)
>
> start (Thu Apr 30 17:36:59 2015) with lapw0 (40/99 to go)
>
> cycle 1 (Thu Apr 30 17:36:59 2015) (40/99 to go)
>
> > lapw0 -p (17:36:59) starting parallel lapw0 at Thu Apr 30 17:36:59 2015
> -------- .machine0 : 32 processors
> 904.074u 8.710s 1:01.54 1483.2% 0+0k 239608+78400io 105pf+0w
> > lapw1 -p -c (17:38:01) starting parallel lapw1 at Thu Apr 30 17:38:01 2015
> -> starting parallel LAPW1 jobs at Thu Apr 30 17:38:01 2015
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
> z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17
> z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18
> z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18(8) 469689.261u
> 1680.003s 8:12:29.52 1595.1% 0+0k 204560+31265944io 366pf+0w
> Summary of lapw1para:
> z1-17 k=0 user=0 wallclock=0
> 469788.683u 1726.356s 8:12:31.33 1595.5% 0+0k 206128+31266512io
> 379pf+0w
> > lapw2 -p-c (01:50:32) running LAPW2 in parallel mode
> z1-17 0.034u 0.040s 1:35.16 0.0% 0+0k 10696+0io 80pf+0w
> Summary of lapw2para:
> z1-17 user=0.034 wallclock=95.16
> ** LAPW2 crashed!
> 4.645u 0.458s 1:42.01 4.9% 0+0k 74792+45008io 133pf+0w
> error: command /home/stretch/flung/DFT/WIEN2k/lapw2cpara -c lapw2.def
> failed
>
> > stop error
>
> -----------------------------
>
> >When it runs fine on a single node, does it always use the same node
> (say z1-17) or does it run fine on other nodes (like z1-18)?
>
> Not really. The nodes were assigned randomly.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20150506/a7d30862/attachment.html>
More information about the Wien
mailing list