[Wien] Error in mpi+k point parallelization across multiple nodes

Gavin Abo gsabo at crimson.ua.edu
Thu May 7 01:11:03 CEST 2015


Ok, now it is clear that there is no additional error messages.  
Unfortunately, I cannot tell specifically what went wrong from those 
error messages.

You might try replacing mpirun with mpirun_rsh.  As you can see at

http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2013-May/004402.html

they replaced mpirun with mpirun_rsh, and it seems that they found a 
problem with passwordless ssh.

In your parallel_options file, you might also want to change "setenv 
USE_REMOTE 0" to "setenv USE_REMOTE 1", then try both 0 and 1 for 
MPI_REMOTE to check if any of these other configurations work or not 
while using mpirun.

On 5/6/2015 11:23 AM, lung Fermin wrote:
> Thanks for the reply. Please see below.
>
>
> >As I asked before, did you give us all the error information in the 
> case.dayfile and from standard output?  It is not entirely clear in 
> your previous posts, but it looks to me that you might have only 
> provided information from the case.dayfile and the error files (cat 
> *.error), but maybe not from the standard output.  Are you still using 
> the PBS script in your old post at 
> http://www.mail-archive.com/wien%40zeus.theochem.tuwien.ac.at/msg11770.html ? 
> In the script, I can see that the standard output is set to be written 
> to a file called wien2k_output.
>
>
> Sorry for the confusion. Yes, I still use the PBS script in the above 
> link. The posts before are from the standard outputs (wien2k). When 
> using 2 nodes with 32 cores for one k point, the standard output gives
> ----------------------------
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 
> z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 
> z1-18 z1-1
> 8 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18
> number of processors: 32
>  LAPW0 END
> [16] Failed to dealloc pd (Device or resource busy)
> [0] Failed to dealloc pd (Device or resource busy)
> [17] Failed to dealloc pd (Device or resource busy)
> [2] Failed to dealloc pd (Device or resource busy)
> [18] Failed to dealloc pd (Device or resource busy)
> [1] Failed to dealloc pd (Device or resource busy)
>  LAPW1 END
> LAPW2 - FERMI; weighs written
> [z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291) 
> terminated with signal 9 -> abort job
> [z1-17:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 
> 9. MPI process died?
> [z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket. 
> MPI process died?
> [z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node 
> z1-17 aborted: Error while reading a PMI socket (4)
> [z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file 
> descriptor 21. MPI process died?
> [z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file 
> descriptor 21. MPI process died?
> [z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI 
> process died?
> cp: cannot stat `.in.tmp': No such file or directory
>
> >   stop error
> -----------------------------
> And the .dayfile reads:
>
> on z1-17 with PID 29439
> using WIEN2k_14.2 (Release 15/10/2014)
>
>     start (Thu Apr 30 17:36:59 2015) with lapw0 (40/99 to go)
>
>     cycle 1 (Thu Apr 30 17:36:59 2015)  (40/99 to go)
>
> >   lapw0 -p (17:36:59) starting parallel lapw0 at Thu Apr 30 17:36:59  2015
> -------- .machine0 : 32 processors
> 904.074u 8.710s 1:01.54 1483.2% 0+0k 239608+78400io 105pf+0w
> >   lapw1  -p   -c      (17:38:01) starting parallel lapw1 at Thu Apr 30 17:38:01 2015
> ->  starting parallel LAPW1 jobs at Thu Apr 30 17:38:01 2015
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
>      z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 
> z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18
>  z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18(8) 469689.261u 
> 1680.003s 8:12:29.52 1595.1%      0+0k 204560+31265944io 366pf+0w
>    Summary of lapw1para:
>    z1-17 k=0     user=0  wallclock=0
> 469788.683u 1726.356s 8:12:31.33 1595.5%        0+0k 206128+31266512io 
> 379pf+0w
> >   lapw2 -p-c       (01:50:32) running LAPW2 in parallel mode
>       z1-17 0.034u 0.040s 1:35.16 0.0% 0+0k 10696+0io 80pf+0w
>    Summary of lapw2para:
>    z1-17 user=0.034      wallclock=95.16
> **  LAPW2 crashed!
> 4.645u 0.458s 1:42.01 4.9%      0+0k 74792+45008io 133pf+0w
> error: command /home/stretch/flung/DFT/WIEN2k/lapw2cpara -c lapw2.def 
> failed
>
> >   stop error
>
> -----------------------------
>
> >When it runs fine on a single node, does it always use the same node 
> (say z1-17) or does it run fine on other nodes (like z1-18)?
>
> Not really. The nodes were assigned randomly.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20150506/a7d30862/attachment.html>


More information about the Wien mailing list