[Wien] Error in mpi+k point parallelization across multiple nodes

Peter Blaha pblaha at theochem.tuwien.ac.at
Thu Apr 30 07:51:43 CEST 2015


Try setting
setenv MPI_REMOTE 0
in parallel options.

Am 29.04.2015 um 09:44 schrieb lung Fermin:
> Thanks for your comment, Prof. Marks.
>
> Each node on the cluster has 32GB memory and each core (16 in total) on
> the node is limited to 2GB of memory usage. For the current system, I
> used RKMAX=6,  and the smallest RMT=2.25.
>
> I have tested the calculation with single k point and mpi on 16 cores
> within a node. The matrix size from
>
> $ cat *.nmat_only
>
> is       29138
>
> Does this mean that the number of matrix elements is 29138 or (29138)^2?
> In general, how shall I estimate the memory required for a calculation?
>
> I have also checked the memory usage with "top" on the node. Each core
> has used up ~5% of the memory and this adds up to ~5*16% on the node.
> Perhaps the problem is really caused by the overflow of memory.. I am
> now queuing on the cluster to test for the case of mpi over 32 cores (2
> nodes).
>
> Thanks.
>
> Regards,
> Fermin
>
> ----------------------------------------------------------------------------------------------------------------
>
> As an addendum, the calculation may be too big for a single node. How
> much memory does the node have, what is the RKMAX, the smallest RMT &
> unit cell size? Maybe use in your machines file
>
> 1:z1-2:16 z1-13:16
> lapw0: z1-2:16 z1-13:16
> granularity:1
> extrafine:1
>
> Check the size using
> x law1 -c -p -nmat_only
> cat *.nmat
>
> ___________________________
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> www.numis.northwestern.edu <http://www.numis.northwestern.edu>
> MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu>
> Co-Editor, Acta Cryst A
> "Research is to see what everybody else has seen, and to think what
> nobody else has thought"
> Albert Szent-Gyorgi
>
> On Apr 28, 2015 10:45 PM, "Laurence Marks" <L-marks at northwestern.edu
> <mailto:L-marks at northwestern.edu>> wrote:
>
> Unfortunately it is hard to know what is going on. A google search on
> "Error while reading PMI socket." indicates that the message you have
> means it did not work, and is not specific. Some suggestions:
>
> a) Try mpiexec (slightly different arguments). You just edit
> parallel_options.
> https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
> b) Try an older version of mvapich2 if it is on the system.
> c) Do you have to launch mpdboot for your system
> https://wiki.calculquebec.ca/w/MVAPICH2/en?
> d) Talk to a sys_admin, particularly the one who setup mvapich
> e) Do "cat *.error", maybe something else went wrong or it is not mpi's
> fault but a user error.
>
> ___________________________
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> www.numis.northwestern.edu <http://www.numis.northwestern.edu>
> MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu>
> Co-Editor, Acta Cryst A
> "Research is to see what everybody else has seen, and to think what
> nobody else has thought"
> Albert Szent-Gyorgi
>
> On Apr 28, 2015 10:17 PM, "lung Fermin" <ferminlung at gmail.com
> <mailto:ferminlung at gmail.com>> wrote:
>
> Thanks for Prof. Marks' comment.
>
> 1. In the previous email, I have missed to copy the line
>
> setenv WIEN_MPIRUN "/usr/local/mvapich2-icc/bin/mpirun -np _NP_
> -hostfile _HOSTS_ _EXEC_"
>
> It was in the parallel_option. Sorry about that.
>
> 2. I have checked that the running program was lapw1c_mpi. Besides, when
> the mpi calculation was done on a single node for some other system, the
> results are consistent with the literature. So I believe that the mpi
> code has been setup and compiled properly.
>
> Would there be something wrong with my option in siteconfig..? Do I have
> to set some command to bind the job? Any other possible cause of the error?
>
> Any suggestions or comments would be appreciated. Thanks.
>
> Regards,
>
> Fermin
>
> ----------------------------------------------------------------------------------------------------
>
> You appear to be missing the line
>
> setenv WIEN_MPIRUN=...
>
> This is setup when you run siteconfig, and provides the information on
> how mpi is run on your system.
>
> N.B., did you setup and compile the mpi code?
>
> ___________________________
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> www.numis.northwestern.edu <http://www.numis.northwestern.edu>
> MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu>
> Co-Editor, Acta Cryst A
> "Research is to see what everybody else has seen, and to think what
> nobody else has thought"
> Albert Szent-Gyorgi
>
> On Apr 28, 2015 4:22 AM, "lung Fermin" <ferminlung at gmail.com
> <mailto:ferminlung at gmail.com>> wrote:
>
> Dear Wien2k community,
>
> I am trying to perform calculation on a system of ~100 in-equivalent
> atoms using mpi+k point parallelization on a cluster. Everything goes
> fine when the program was run on a single node. However, if I perform
> the calculation across different nodes, the follow error occurs. How to
> solve this problem? I am a newbie to mpi programming, any help would be
> appreciated. Thanks.
>
> The error message (MVAPICH2 2.0a):
>
> ---------------------------------------------------------------------------------------------------
>
> Warning: no access to tty (Bad file descriptor).
>
> Thus no job control in this shell.
>
> z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
> z1-2 z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1
>
> -13 z1-13 z1-13 z1-13 z1-13 z1-13
>
> number of processors: 32
>
>   LAPW0 END
>
> [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
> z1-13 aborted: Error while reading a PMI socket (4)
>
> [z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546)
> terminated with signal 9 -> abort job
>
> [z1-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
> 8. MPI process died?
>
> [z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
>
> [z1-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
> 12. MPI process died?
>
> [z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
>
> [z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454)
> terminated with signal 9 -> abort job
>
> [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-2
> aborted: MPI process error (1)
>
> [cli_15]: aborting job:
>
> application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15
>
>>   stop error
>
> ------------------------------------------------------------------------------------------------------
>
> The .machines file:
>
> #
>
> 1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
> z1-2 z1-2
>
> 1:z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13
> z1-13 z1-13 z1-13 z1-13 z1-13
>
> granularity:1
>
> extrafine:1
>
> --------------------------------------------------------------------------------------------------------
>
> The parallel_options:
>
> setenv TASKSET "no"
>
> setenv USE_REMOTE 0
>
> setenv MPI_REMOTE 1
>
> setenv WIEN_GRANULARITY 1
>
> --------------------------------------------------------------------------------------------------------
>
> Thanks.
>
> Regards,
>
> Fermin
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>

-- 
Peter Blaha
Inst.Materials Chemistry
TU Vienna
Getreidemarkt 9
A-1060 Vienna
Austria
+43-1-5880115671


More information about the Wien mailing list