[Wien] Error in mpi+k point parallelization across multiple nodes
lung Fermin
ferminlung at gmail.com
Mon May 4 05:18:46 CEST 2015
I have tried to set MPI_REMOTE=0 and used 32 cores (on 2 nodes) for
distributing the mpi job. However, the problem still persist... but the
error message looks different this time:
$> cat *.error
Error in LAPW2
** testerror: Error in Parallel LAPW2
and the output on screen:
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17
z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18
z1-18 z1-18 z1-18 z1-18 z1-18 z1-1
8 z1-18 z1-18
number of processors: 32
LAPW0 END
[16] Failed to dealloc pd (Device or resource busy)
[0] Failed to dealloc pd (Device or resource busy)
[17] Failed to dealloc pd (Device or resource busy)
[2] Failed to dealloc pd (Device or resource busy)
[18] Failed to dealloc pd (Device or resource busy)
[1] Failed to dealloc pd (Device or resource busy)
LAPW1 END
LAPW2 - FERMI; weighs written
[z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291)
terminated with signal 9 -> abort job
[z1-17:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 9.
MPI process died?
[z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
process died?
[z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-17
aborted: Error while reading a PMI socket (4)
[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21.
MPI process died?
[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21.
MPI process died?
[z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI
process died?
cp: cannot stat `.in.tmp': No such file or directory
> stop error
------------------------------------------------------------------------------------------------------------
Try setting
setenv MPI_REMOTE 0
in parallel options.
Am 29.04.2015 um 09:44 schrieb lung Fermin:
> Thanks for your comment, Prof. Marks.
>
> Each node on the cluster has 32GB memory and each core (16 in total)
> on the node is limited to 2GB of memory usage. For the current system,
> I used RKMAX=6, and the smallest RMT=2.25.
>
> I have tested the calculation with single k point and mpi on 16 cores
> within a node. The matrix size from
>
> $ cat *.nmat_only
>
> is 29138
>
> Does this mean that the number of matrix elements is 29138 or (29138)^2?
> In general, how shall I estimate the memory required for a calculation?
>
> I have also checked the memory usage with "top" on the node. Each core
> has used up ~5% of the memory and this adds up to ~5*16% on the node.
> Perhaps the problem is really caused by the overflow of memory.. I am
> now queuing on the cluster to test for the case of mpi over 32 cores
> (2 nodes).
>
> Thanks.
>
> Regards,
> Fermin
>
> ----------------------------------------------------------------------
> ------------------------------------------
>
> As an addendum, the calculation may be too big for a single node. How
> much memory does the node have, what is the RKMAX, the smallest RMT &
> unit cell size? Maybe use in your machines file
>
> 1:z1-2:16 z1-13:16
> lapw0: z1-2:16 z1-13:16
> granularity:1
> extrafine:1
>
> Check the size using
> x law1 -c -p -nmat_only
> cat *.nmat
>
> ___________________________
> Professor Laurence Marks
> Department of Materials Science and Engineering Northwestern
> University www.numis.northwestern.edu
> <http://www.numis.northwestern.edu>
> MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu>
> Co-Editor, Acta Cryst A
> "Research is to see what everybody else has seen, and to think what
> nobody else has thought"
> Albert Szent-Gyorgi
>
> On Apr 28, 2015 10:45 PM, "Laurence Marks" <L-marks at northwestern.edu
> <mailto:L-marks at northwestern.edu <L-marks at northwestern.edu>>> wrote:
>
> Unfortunately it is hard to know what is going on. A google search on
> "Error while reading PMI socket." indicates that the message you have
> means it did not work, and is not specific. Some suggestions:
>
> a) Try mpiexec (slightly different arguments). You just edit
> parallel_options.
> https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
> b) Try an older version of mvapich2 if it is on the system.
> c) Do you have to launch mpdboot for your system
> https://wiki.calculquebec.ca/w/MVAPICH2/en?
> d) Talk to a sys_admin, particularly the one who setup mvapich
> e) Do "cat *.error", maybe something else went wrong or it is not
> mpi's fault but a user error.
>
> ___________________________
> Professor Laurence Marks
> Department of Materials Science and Engineering Northwestern
> University www.numis.northwestern.edu
> <http://www.numis.northwestern.edu>
> MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu>
> Co-Editor, Acta Cryst A
> "Research is to see what everybody else has seen, and to think what
> nobody else has thought"
> Albert Szent-Gyorgi
>
> On Apr 28, 2015 10:17 PM, "lung Fermin" <ferminlung at gmail.com
> <mailto:ferminlung at gmail.com <ferminlung at gmail.com>>> wrote:
>
> Thanks for Prof. Marks' comment.
>
> 1. In the previous email, I have missed to copy the line
>
> setenv WIEN_MPIRUN "/usr/local/mvapich2-icc/bin/mpirun -np _NP_
> -hostfile _HOSTS_ _EXEC_"
>
> It was in the parallel_option. Sorry about that.
>
> 2. I have checked that the running program was lapw1c_mpi. Besides,
> when the mpi calculation was done on a single node for some other
> system, the results are consistent with the literature. So I believe
> that the mpi code has been setup and compiled properly.
>
> Would there be something wrong with my option in siteconfig..? Do I
> have to set some command to bind the job? Any other possible cause of the
error?
>
> Any suggestions or comments would be appreciated. Thanks.
>
> Regards,
>
> Fermin
>
> ----------------------------------------------------------------------
> ------------------------------
>
> You appear to be missing the line
>
> setenv WIEN_MPIRUN=...
>
> This is setup when you run siteconfig, and provides the information on
> how mpi is run on your system.
>
> N.B., did you setup and compile the mpi code?
>
> ___________________________
> Professor Laurence Marks
> Department of Materials Science and Engineering Northwestern
> University www.numis.northwestern.edu
> <http://www.numis.northwestern.edu>
> MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu>
> Co-Editor, Acta Cryst A
> "Research is to see what everybody else has seen, and to think what
> nobody else has thought"
> Albert Szent-Gyorgi
>
> On Apr 28, 2015 4:22 AM, "lung Fermin" <ferminlung at gmail.com
> <mailto:ferminlung at gmail.com <ferminlung at gmail.com>>> wrote:
>
> Dear Wien2k community,
>
> I am trying to perform calculation on a system of ~100 in-equivalent
> atoms using mpi+k point parallelization on a cluster. Everything goes
> fine when the program was run on a single node. However, if I perform
> the calculation across different nodes, the follow error occurs. How
> to solve this problem? I am a newbie to mpi programming, any help
> would be appreciated. Thanks.
>
> The error message (MVAPICH2 2.0a):
>
> ----------------------------------------------------------------------
> -----------------------------
>
> Warning: no access to tty (Bad file descriptor).
>
> Thus no job control in this shell.
>
> z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
> z1-2 z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13
> z1
>
> -13 z1-13 z1-13 z1-13 z1-13 z1-13
>
> number of processors: 32
>
> LAPW0 END
>
> [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
> z1-13 aborted: Error while reading a PMI socket (4)
>
> [z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546)
> terminated with signal 9 -> abort job
>
> [z1-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
> 8. MPI process died?
>
> [z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
> MPI process died?
>
> [z1-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
> 12. MPI process died?
>
> [z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
> MPI process died?
>
> [z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454)
> terminated with signal 9 -> abort job
>
> [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
> z1-2
> aborted: MPI process error (1)
>
> [cli_15]: aborting job:
>
> application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15
>
>> stop error
>
> ----------------------------------------------------------------------
> --------------------------------
>
> The .machines file:
>
> #
>
> 1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
> z1-2
> z1-2 z1-2
>
> 1:z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13
> z1-13 z1-13 z1-13 z1-13 z1-13
>
> granularity:1
>
> extrafine:1
>
> ----------------------------------------------------------------------
> ----------------------------------
>
> The parallel_options:
>
> setenv TASKSET "no"
>
> setenv USE_REMOTE 0
>
> setenv MPI_REMOTE 1
>
> setenv WIEN_GRANULARITY 1
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20150504/47633fb5/attachment.html>
More information about the Wien
mailing list