[Wien] Error in mpi+k point parallelization across multiple nodes

Laurence Marks L-marks at northwestern.edu
Wed Apr 29 05:45:04 CEST 2015


Unfortunately it is hard to know what is going on. A google search on
"Error while reading PMI socket." indicates that the message you have means
it did not work, and is not specific. Some suggestions:

a) Try mpiexec (slightly different arguments). You just edit
parallel_options.
https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
b) Try an older version of mvapich2 if it is on the system.
c) Do you have to launch mpdboot for your system
https://wiki.calculquebec.ca/w/MVAPICH2/en?
d) Talk to a sys_admin, particularly the one who setup mvapich
e) Do "cat *.error", maybe something else went wrong or it is not mpi's
fault but a user error.

___________________________
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
"Research is to see what everybody else has seen, and to think what nobody
else has thought"
Albert Szent-Gyorgi
On Apr 28, 2015 10:17 PM, "lung Fermin" <ferminlung at gmail.com> wrote:

>  Thanks for Prof. Marks' comment.
>
> 1. In the previous email, I have missed to copy the line
>
> setenv WIEN_MPIRUN "/usr/local/mvapich2-icc/bin/mpirun -np _NP_ -hostfile
> _HOSTS_ _EXEC_"
> It was in the parallel_option. Sorry about that.
>
> 2. I have checked that the running program was lapw1c_mpi. Besides, when
> the mpi calculation was done on a single node for some other system, the
> results are consistent with the literature. So I believe that the mpi code
> has been setup and compiled properly.
>
> Would there be something wrong with my option in siteconfig..? Do I have
> to set some command to bind the job? Any other possible cause of the error?
>
> Any suggestions or comments would be appreciated. Thanks.
>
>
>  Regards,
>
> Fermin
>
>
> ----------------------------------------------------------------------------------------------------
>
> You appear to be missing the line
>
> setenv WIEN_MPIRUN=...
>
> This is setup when you run siteconfig, and provides the information on how
> mpi is run on your system.
>
> N.B., did you setup and compile the mpi code?
>
> ___________________________
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> www.numis.northwestern.edu
> MURI4D.numis.northwestern.edu
> Co-Editor, Acta Cryst A
> "Research is to see what everybody else has seen, and to think what nobody
> else has thought"
> Albert Szent-Gyorgi
>
> On Apr 28, 2015 4:22 AM, "lung Fermin" <ferminlung at gmail.com> wrote:
>
> Dear Wien2k community,
>
>
>
> I am trying to perform calculation on a system of ~100 in-equivalent atoms
> using mpi+k point parallelization on a cluster. Everything goes fine when
> the program was run on a single node. However, if I perform the calculation
> across different nodes, the follow error occurs. How to solve this problem?
> I am a newbie to mpi programming, any help would be appreciated. Thanks.
>
>
>
> The error message (MVAPICH2 2.0a):
>
>
> ---------------------------------------------------------------------------------------------------
>
> Warning: no access to tty (Bad file descriptor).
>
> Thus no job control in this shell.
>
> z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
> z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1
>
> -13 z1-13 z1-13 z1-13 z1-13 z1-13
>
> number of processors: 32
>
>  LAPW0 END
>
> [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-13
> aborted: Error while reading a PMI socket (4)
>
> [z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546)
> terminated with signal 9 -> abort job
>
> [z1-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 8.
> MPI process died?
>
> [z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
>
> [z1-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 12.
> MPI process died?
>
> [z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
> process died?
>
> [z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454)
> terminated with signal 9 -> abort job
>
> [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-2
> aborted: MPI process error (1)
>
> [cli_15]: aborting job:
>
> application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15
>
>
>
> >   stop error
>
>
> ------------------------------------------------------------------------------------------------------
>
>
>
> The .machines file:
>
> #
>
> 1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
> z1-2 z1-2
>
> 1:z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13
> z1-13 z1-13 z1-13 z1-13
>
> granularity:1
>
> extrafine:1
>
>
> --------------------------------------------------------------------------------------------------------
>
> The parallel_options:
>
>
>
> setenv TASKSET "no"
>
> setenv USE_REMOTE 0
>
> setenv MPI_REMOTE 1
>
> setenv WIEN_GRANULARITY 1
>
>
>
>
> --------------------------------------------------------------------------------------------------------
>
>
>
> Thanks.
>
>
>
> Regards,
>
> Fermin
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20150428/497a15e3/attachment.html>


More information about the Wien mailing list