[Wien] Error in mpi+k point parallelization across multiple nodes

lung Fermin ferminlung at gmail.com
Tue Apr 28 11:22:08 CEST 2015


Dear Wien2k community,

I am trying to perform calculation on a system of ~100 in-equivalent atoms
using mpi+k point parallelization on a cluster. Everything goes fine when
the program was run on a single node. However, if I perform the calculation
across different nodes, the follow error occurs. How to solve this problem?
I am a newbie to mpi programming, any help would be appreciated. Thanks.

The error message (MVAPICH2 2.0a):
---------------------------------------------------------------------------------------------------
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1
-13 z1-13 z1-13 z1-13 z1-13 z1-13
number of processors: 32
 LAPW0 END
[z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-13
aborted: Error while reading a PMI socket (4)
[z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546)
terminated with signal 9 -> abort job
[z1-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 8.
MPI process died?
[z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
process died?
[z1-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 12.
MPI process died?
[z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
process died?
[z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454)
terminated with signal 9 -> abort job
[z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-2
aborted: MPI process error (1)
[cli_15]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15

>   stop error
------------------------------------------------------------------------------------------------------

The .machines file:
#
1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
z1-2 z1-2
1:z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13
z1-13 z1-13 z1-13 z1-13
granularity:1
extrafine:1
--------------------------------------------------------------------------------------------------------
The parallel_options:

setenv TASKSET "no"
setenv USE_REMOTE 0
setenv MPI_REMOTE 1
setenv WIEN_GRANULARITY 1

--------------------------------------------------------------------------------------------------------

Thanks.

Regards,
Fermin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20150428/5af2a60f/attachment.html>


More information about the Wien mailing list