[Wien] Error in mpi+k point parallelization across multiple nodes

lung Fermin ferminlung at gmail.com
Mon May 4 05:18:46 CEST 2015


I have tried to set MPI_REMOTE=0 and used 32 cores (on 2 nodes) for
distributing the mpi job. However, the problem still persist... but the
error message looks different this time:

$> cat *.error
Error in LAPW2
**  testerror: Error in Parallel LAPW2

and the output on screen:
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17
z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18
z1-18 z1-18 z1-18 z1-18 z1-18 z1-1
8 z1-18 z1-18
number of processors: 32
 LAPW0 END
[16] Failed to dealloc pd (Device or resource busy)
[0] Failed to dealloc pd (Device or resource busy)
[17] Failed to dealloc pd (Device or resource busy)
[2] Failed to dealloc pd (Device or resource busy)
[18] Failed to dealloc pd (Device or resource busy)
[1] Failed to dealloc pd (Device or resource busy)
 LAPW1 END
LAPW2 - FERMI; weighs written
[z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291)
terminated with signal 9 -> abort job
[z1-17:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 9.
MPI process died?
[z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
process died?
[z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-17
aborted: Error while reading a PMI socket (4)
[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21.
MPI process died?
[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21.
MPI process died?
[z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI
process died?
cp: cannot stat `.in.tmp': No such file or directory

>   stop error


------------------------------------------------------------------------------------------------------------

Try setting

setenv MPI_REMOTE 0

in parallel options.



Am 29.04.2015 um 09:44 schrieb lung Fermin:

> Thanks for your comment, Prof. Marks.

>

> Each node on the cluster has 32GB memory and each core (16 in total)

> on the node is limited to 2GB of memory usage. For the current system,

> I used RKMAX=6,  and the smallest RMT=2.25.

>

> I have tested the calculation with single k point and mpi on 16 cores

> within a node. The matrix size from

>

> $ cat *.nmat_only

>

> is       29138

>

> Does this mean that the number of matrix elements is 29138 or (29138)^2?

> In general, how shall I estimate the memory required for a calculation?

>

> I have also checked the memory usage with "top" on the node. Each core

> has used up ~5% of the memory and this adds up to ~5*16% on the node.

> Perhaps the problem is really caused by the overflow of memory.. I am

> now queuing on the cluster to test for the case of mpi over 32 cores

> (2 nodes).

>

> Thanks.

>

> Regards,

> Fermin

>

> ----------------------------------------------------------------------

> ------------------------------------------

>

> As an addendum, the calculation may be too big for a single node. How

> much memory does the node have, what is the RKMAX, the smallest RMT &

> unit cell size? Maybe use in your machines file

>

> 1:z1-2:16 z1-13:16

> lapw0: z1-2:16 z1-13:16

> granularity:1

> extrafine:1

>

> Check the size using

> x law1 -c -p -nmat_only

> cat *.nmat

>

> ___________________________

> Professor Laurence Marks

> Department of Materials Science and Engineering Northwestern

> University www.numis.northwestern.edu

> <http://www.numis.northwestern.edu>

> MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu>

> Co-Editor, Acta Cryst A

> "Research is to see what everybody else has seen, and to think what

> nobody else has thought"

> Albert Szent-Gyorgi

>

> On Apr 28, 2015 10:45 PM, "Laurence Marks" <L-marks at northwestern.edu

> <mailto:L-marks at northwestern.edu <L-marks at northwestern.edu>>> wrote:

>

> Unfortunately it is hard to know what is going on. A google search on

> "Error while reading PMI socket." indicates that the message you have

> means it did not work, and is not specific. Some suggestions:

>

> a) Try mpiexec (slightly different arguments). You just edit

> parallel_options.

> https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager

> b) Try an older version of mvapich2 if it is on the system.

> c) Do you have to launch mpdboot for your system

> https://wiki.calculquebec.ca/w/MVAPICH2/en?

> d) Talk to a sys_admin, particularly the one who setup mvapich

> e) Do "cat *.error", maybe something else went wrong or it is not

> mpi's fault but a user error.

>

> ___________________________

> Professor Laurence Marks

> Department of Materials Science and Engineering Northwestern

> University www.numis.northwestern.edu

> <http://www.numis.northwestern.edu>

> MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu>

> Co-Editor, Acta Cryst A

> "Research is to see what everybody else has seen, and to think what

> nobody else has thought"

> Albert Szent-Gyorgi

>

> On Apr 28, 2015 10:17 PM, "lung Fermin" <ferminlung at gmail.com

> <mailto:ferminlung at gmail.com <ferminlung at gmail.com>>> wrote:

>

> Thanks for Prof. Marks' comment.

>

> 1. In the previous email, I have missed to copy the line

>

> setenv WIEN_MPIRUN "/usr/local/mvapich2-icc/bin/mpirun -np _NP_

> -hostfile _HOSTS_ _EXEC_"

>

> It was in the parallel_option. Sorry about that.

>

> 2. I have checked that the running program was lapw1c_mpi. Besides,

> when the mpi calculation was done on a single node for some other

> system, the results are consistent with the literature. So I believe

> that the mpi code has been setup and compiled properly.

>

> Would there be something wrong with my option in siteconfig..? Do I

> have to set some command to bind the job? Any other possible cause of the
error?

>

> Any suggestions or comments would be appreciated. Thanks.

>

> Regards,

>

> Fermin

>

> ----------------------------------------------------------------------

> ------------------------------

>

> You appear to be missing the line

>

> setenv WIEN_MPIRUN=...

>

> This is setup when you run siteconfig, and provides the information on

> how mpi is run on your system.

>

> N.B., did you setup and compile the mpi code?

>

> ___________________________

> Professor Laurence Marks

> Department of Materials Science and Engineering Northwestern

> University www.numis.northwestern.edu

> <http://www.numis.northwestern.edu>

> MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu>

> Co-Editor, Acta Cryst A

> "Research is to see what everybody else has seen, and to think what

> nobody else has thought"

> Albert Szent-Gyorgi

>

> On Apr 28, 2015 4:22 AM, "lung Fermin" <ferminlung at gmail.com

> <mailto:ferminlung at gmail.com <ferminlung at gmail.com>>> wrote:

>

> Dear Wien2k community,

>

> I am trying to perform calculation on a system of ~100 in-equivalent

> atoms using mpi+k point parallelization on a cluster. Everything goes

> fine when the program was run on a single node. However, if I perform

> the calculation across different nodes, the follow error occurs. How

> to solve this problem? I am a newbie to mpi programming, any help

> would be appreciated. Thanks.

>

> The error message (MVAPICH2 2.0a):

>

> ----------------------------------------------------------------------

> -----------------------------

>

> Warning: no access to tty (Bad file descriptor).

>

> Thus no job control in this shell.

>

> z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2

> z1-2 z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13

> z1

>

> -13 z1-13 z1-13 z1-13 z1-13 z1-13

>

> number of processors: 32

>

>   LAPW0 END

>

> [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node

> z1-13 aborted: Error while reading a PMI socket (4)

>

> [z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546)

> terminated with signal 9 -> abort job

>

> [z1-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor

> 8. MPI process died?

>

> [z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket.

> MPI process died?

>

> [z1-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor

> 12. MPI process died?

>

> [z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket.

> MPI process died?

>

> [z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454)

> terminated with signal 9 -> abort job

>

> [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node

> z1-2

> aborted: MPI process error (1)

>

> [cli_15]: aborting job:

>

> application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15

>

>>   stop error

>

> ----------------------------------------------------------------------

> --------------------------------

>

> The .machines file:

>

> #

>

> 1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2

> z1-2

> z1-2 z1-2

>

> 1:z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13

> z1-13 z1-13 z1-13 z1-13 z1-13

>

> granularity:1

>

> extrafine:1

>

> ----------------------------------------------------------------------

> ----------------------------------

>

> The parallel_options:

>

> setenv TASKSET "no"

>

> setenv USE_REMOTE 0

>

> setenv MPI_REMOTE 1

>

> setenv WIEN_GRANULARITY 1

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20150504/47633fb5/attachment.html>


More information about the Wien mailing list