[Wien] Error in mpi+k point parallelization across multiple nodes

Laurence Marks L-marks at northwestern.edu
Mon May 4 05:46:51 CEST 2015


I suspect that there is something wrong with your IB and/or how it has been
installed. I doubt anyone on the list can help you as it sounds like an OS
problem. If you provide the struct file someone might be able to check that
it is not a setup problem.

1) Try mpiexec
2) Post to the mvapich2 list.
3) Get help from your sys admin.

___________________________
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
"Research is to see what everybody else has seen, and to think what nobody
else has thought"
Albert Szent-Gyorgi
On May 3, 2015 10:19 PM, "lung Fermin" <ferminlung at gmail.com> wrote:

>  I have tried to set MPI_REMOTE=0 and used 32 cores (on 2 nodes) for
> distributing the mpi job. However, the problem still persist... but the
> error message looks different this time:
>
>  $> cat *.error
> Error in LAPW2
> **  testerror: Error in Parallel LAPW2
>
>  and the output on screen:
>  Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17
> z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18
> z1-18 z1-18 z1-18 z1-18 z1-18 z1-1
> 8 z1-18 z1-18
> number of processors: 32
>  LAPW0 END
> [16] Failed to dealloc pd (Device or resource busy)
> [0] Failed to dealloc pd (Device or resource busy)
> [17] Failed to dealloc pd (Device or resource busy)
> [2] Failed to dealloc pd (Device or resource busy)
> [18] Failed to dealloc pd (Device or resource busy)
> [1] Failed to dealloc pd (Device or resource busy)
>  LAPW1 END
> LAPW2 - FERMI; weighs written
> [z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291)
> terminated with signal 9 -> abort job
> [z1-17:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 9.
> MPI process died?
> [z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
> [z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-17
> aborted: Error while reading a PMI socket (4)
> [z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor
> 21. MPI process died?
> [z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor
> 21. MPI process died?
> [z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI
> process died?
> cp: cannot stat `.in.tmp': No such file or directory
>
>  >   stop error
>
>
>
> ------------------------------------------------------------------------------------------------------------
>
> Try setting
>
> setenv MPI_REMOTE 0
>
> in parallel options.
>
>
>
> Am 29.04.2015 um 09:44 schrieb lung Fermin:
>
> > Thanks for your comment, Prof. Marks.
>
> >
>
> > Each node on the cluster has 32GB memory and each core (16 in total)
>
> > on the node is limited to 2GB of memory usage. For the current system,
>
> > I used RKMAX=6,  and the smallest RMT=2.25.
>
> >
>
> > I have tested the calculation with single k point and mpi on 16 cores
>
> > within a node. The matrix size from
>
> >
>
> > $ cat *.nmat_only
>
> >
>
> > is       29138
>
> >
>
> > Does this mean that the number of matrix elements is 29138 or (29138)^2?
>
> > In general, how shall I estimate the memory required for a calculation?
>
> >
>
> > I have also checked the memory usage with "top" on the node. Each core
>
> > has used up ~5% of the memory and this adds up to ~5*16% on the node.
>
> > Perhaps the problem is really caused by the overflow of memory.. I am
>
> > now queuing on the cluster to test for the case of mpi over 32 cores
>
> > (2 nodes).
>
> >
>
> > Thanks.
>
> >
>
> > Regards,
>
> > Fermin
>
> >
>
> > ----------------------------------------------------------------------
>
> > ------------------------------------------
>
> >
>
> > As an addendum, the calculation may be too big for a single node. How
>
> > much memory does the node have, what is the RKMAX, the smallest RMT &
>
> > unit cell size? Maybe use in your machines file
>
> >
>
> > 1:z1-2:16 z1-13:16
>
> > lapw0: z1-2:16 z1-13:16
>
> > granularity:1
>
> > extrafine:1
>
> >
>
> > Check the size using
>
> > x law1 -c -p -nmat_only
>
> > cat *.nmat
>
> >
>
> > ___________________________
>
> > Professor Laurence Marks
>
> > Department of Materials Science and Engineering Northwestern
>
> > University www.numis.northwestern.edu
>
> > <http://www.numis.northwestern.edu>
>
> > MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu>
>
> > Co-Editor, Acta Cryst A
>
> > "Research is to see what everybody else has seen, and to think what
>
> > nobody else has thought"
>
> > Albert Szent-Gyorgi
>
> >
>
> > On Apr 28, 2015 10:45 PM, "Laurence Marks" <L-marks at northwestern.edu
>
> > <mailto:L-marks at northwestern.edu <L-marks at northwestern.edu>>> wrote:
>
> >
>
> > Unfortunately it is hard to know what is going on. A google search on
>
> > "Error while reading PMI socket." indicates that the message you have
>
> > means it did not work, and is not specific. Some suggestions:
>
> >
>
> > a) Try mpiexec (slightly different arguments). You just edit
>
> > parallel_options.
>
> > https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
>
> > b) Try an older version of mvapich2 if it is on the system.
>
> > c) Do you have to launch mpdboot for your system
>
> > https://wiki.calculquebec.ca/w/MVAPICH2/en?
>
> > d) Talk to a sys_admin, particularly the one who setup mvapich
>
> > e) Do "cat *.error", maybe something else went wrong or it is not
>
> > mpi's fault but a user error.
>
> >
>
> > ___________________________
>
> > Professor Laurence Marks
>
> > Department of Materials Science and Engineering Northwestern
>
> > University www.numis.northwestern.edu
>
> > <http://www.numis.northwestern.edu>
>
> > MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu>
>
> > Co-Editor, Acta Cryst A
>
> > "Research is to see what everybody else has seen, and to think what
>
> > nobody else has thought"
>
> > Albert Szent-Gyorgi
>
> >
>
> > On Apr 28, 2015 10:17 PM, "lung Fermin" <ferminlung at gmail.com
>
> > <mailto:ferminlung at gmail.com <ferminlung at gmail.com>>> wrote:
>
> >
>
> > Thanks for Prof. Marks' comment.
>
> >
>
> > 1. In the previous email, I have missed to copy the line
>
> >
>
> > setenv WIEN_MPIRUN "/usr/local/mvapich2-icc/bin/mpirun -np _NP_
>
> > -hostfile _HOSTS_ _EXEC_"
>
> >
>
> > It was in the parallel_option. Sorry about that.
>
> >
>
> > 2. I have checked that the running program was lapw1c_mpi. Besides,
>
> > when the mpi calculation was done on a single node for some other
>
> > system, the results are consistent with the literature. So I believe
>
> > that the mpi code has been setup and compiled properly.
>
> >
>
> > Would there be something wrong with my option in siteconfig..? Do I
>
> > have to set some command to bind the job? Any other possible cause of
> the error?
>
> >
>
> > Any suggestions or comments would be appreciated. Thanks.
>
> >
>
> > Regards,
>
> >
>
> > Fermin
>
> >
>
> > ----------------------------------------------------------------------
>
> > ------------------------------
>
> >
>
> > You appear to be missing the line
>
> >
>
> > setenv WIEN_MPIRUN=...
>
> >
>
> > This is setup when you run siteconfig, and provides the information on
>
> > how mpi is run on your system.
>
> >
>
> > N.B., did you setup and compile the mpi code?
>
> >
>
> > ___________________________
>
> > Professor Laurence Marks
>
> > Department of Materials Science and Engineering Northwestern
>
> > University www.numis.northwestern.edu
>
> > <http://www.numis.northwestern.edu>
>
> > MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu>
>
> > Co-Editor, Acta Cryst A
>
> > "Research is to see what everybody else has seen, and to think what
>
> > nobody else has thought"
>
> > Albert Szent-Gyorgi
>
> >
>
> > On Apr 28, 2015 4:22 AM, "lung Fermin" <ferminlung at gmail.com
>
> > <mailto:ferminlung at gmail.com <ferminlung at gmail.com>>> wrote:
>
> >
>
> > Dear Wien2k community,
>
> >
>
> > I am trying to perform calculation on a system of ~100 in-equivalent
>
> > atoms using mpi+k point parallelization on a cluster. Everything goes
>
> > fine when the program was run on a single node. However, if I perform
>
> > the calculation across different nodes, the follow error occurs. How
>
> > to solve this problem? I am a newbie to mpi programming, any help
>
> > would be appreciated. Thanks.
>
> >
>
> > The error message (MVAPICH2 2.0a):
>
> >
>
> > ----------------------------------------------------------------------
>
> > -----------------------------
>
> >
>
> > Warning: no access to tty (Bad file descriptor).
>
> >
>
> > Thus no job control in this shell.
>
> >
>
> > z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
>
> > z1-2 z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13
>
> > z1
>
> >
>
> > -13 z1-13 z1-13 z1-13 z1-13 z1-13
>
> >
>
> > number of processors: 32
>
> >
>
> >   LAPW0 END
>
> >
>
> > [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
>
> > z1-13 aborted: Error while reading a PMI socket (4)
>
> >
>
> > [z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546)
>
> > terminated with signal 9 -> abort job
>
> >
>
> > [z1-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
>
> > 8. MPI process died?
>
> >
>
> > [z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
>
> > MPI process died?
>
> >
>
> > [z1-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
>
> > 12. MPI process died?
>
> >
>
> > [z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
>
> > MPI process died?
>
> >
>
> > [z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454)
>
> > terminated with signal 9 -> abort job
>
> >
>
> > [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
>
> > z1-2
>
> > aborted: MPI process error (1)
>
> >
>
> > [cli_15]: aborting job:
>
> >
>
> > application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15
>
> >
>
> >>   stop error
>
> >
>
> > ----------------------------------------------------------------------
>
> > --------------------------------
>
> >
>
> > The .machines file:
>
> >
>
> > #
>
> >
>
> > 1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
>
> > z1-2
>
> > z1-2 z1-2
>
> >
>
> > 1:z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13
>
> > z1-13 z1-13 z1-13 z1-13 z1-13
>
> >
>
> > granularity:1
>
> >
>
> > extrafine:1
>
> >
>
> > ----------------------------------------------------------------------
>
> > ----------------------------------
>
> >
>
> > The parallel_options:
>
> >
>
> > setenv TASKSET "no"
>
> >
>
> > setenv USE_REMOTE 0
>
> >
>
> > setenv MPI_REMOTE 1
>
> >
>
> > setenv WIEN_GRANULARITY 1
>
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20150503/76f1336e/attachment.html>


More information about the Wien mailing list