[Wien] Error in mpi+k point parallelization across multiple nodes

lung Fermin ferminlung at gmail.com
Wed May 6 19:23:55 CEST 2015


Thanks for the reply. Please see below.


>As I asked before, did you give us all the error information in the
case.dayfile and from standard output?  It is not entirely clear in your
previous posts, but it looks to me that you might have only provided
information from the case.dayfile and the error files (cat *.error), but
maybe not from the standard output.  Are you still using the PBS script in
your old post at
http://www.mail-archive.com/wien%40zeus.theochem.tuwien.ac.at/msg11770.html ?
In the script, I can see that the standard output is set to be written to a
file called wien2k_output.


Sorry for the confusion. Yes, I still use the PBS script in the above link.
The posts before are from the standard outputs (wien2k). When using 2 nodes
with 32 cores for one k point, the standard output gives
----------------------------
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17
z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-1
8 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18
number of processors: 32
 LAPW0 END
[16] Failed to dealloc pd (Device or resource busy)
[0] Failed to dealloc pd (Device or resource busy)
[17] Failed to dealloc pd (Device or resource busy)
[2] Failed to dealloc pd (Device or resource busy)
[18] Failed to dealloc pd (Device or resource busy)
[1] Failed to dealloc pd (Device or resource busy)
 LAPW1 END
LAPW2 - FERMI; weighs written
[z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291)
terminated with signal 9 -> abort job
[z1-17:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 9.
MPI process died?
[z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
process died?
[z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-17
aborted: Error while reading a PMI socket (4)
[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21.
MPI process died?
[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21.
MPI process died?
[z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI
process died?
cp: cannot stat `.in.tmp': No such file or directory

>   stop error
-----------------------------
And the .dayfile reads:

on z1-17 with PID 29439
using WIEN2k_14.2 (Release 15/10/2014)

    start       (Thu Apr 30 17:36:59 2015) with lapw0 (40/99 to go)

    cycle 1     (Thu Apr 30 17:36:59 2015)  (40/99 to go)

>   lapw0 -p    (17:36:59) starting parallel lapw0 at Thu Apr 30    17:36:59
 2015
-------- .machine0 : 32 processors
904.074u 8.710s 1:01.54 1483.2% 0+0k 239608+78400io 105pf+0w
>   lapw1  -p   -c      (17:38:01) starting parallel lapw1 at Thu Apr 30
17:38:01 2015
->  starting parallel LAPW1 jobs at Thu Apr 30 17:38:01 2015
running LAPW1 in parallel mode (using .machines)
1 number_of_parallel_jobs
     z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17
z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18
 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18(8) 469689.261u
1680.003s 8:12:29.52 1595.1%      0+0k 204560+31265944io 366pf+0w
   Summary of lapw1para:
   z1-17         k=0     user=0  wallclock=0
469788.683u 1726.356s 8:12:31.33 1595.5%        0+0k 206128+31266512io
379pf+0w
>   lapw2 -p   -c       (01:50:32) running LAPW2 in parallel mode
      z1-17 0.034u 0.040s 1:35.16 0.0% 0+0k 10696+0io 80pf+0w
   Summary of lapw2para:
   z1-17         user=0.034      wallclock=95.16
**  LAPW2 crashed!
4.645u 0.458s 1:42.01 4.9%      0+0k 74792+45008io 133pf+0w
error: command   /home/stretch/flung/DFT/WIEN2k/lapw2cpara -c lapw2.def
failed

>   stop error

-----------------------------

>When it runs fine on a single node, does it always use the same node (say
z1-17) or does it run fine on other nodes (like z1-18)?

Not really. The nodes were assigned randomly.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20150507/6149a233/attachment.html>


More information about the Wien mailing list