[Wien] Error in mpi+k point parallelization across multiple nodes

Laurence Marks L-marks at northwestern.edu
Wed Apr 29 08:16:46 CEST 2015


As an addendum, the calculation may be too big for a single node. How much
memory does the node have, what is the RKMAX, the smallest RMT & unit cell
size? Maybe use in your machines file

1:z1-2:16 z1-13:16
lapw0: z1-2:16 z1-13:16
granularity:1
extrafine:1

Check the size using
x law1 -c -p -nmat_only
cat *.nmat

___________________________
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
"Research is to see what everybody else has seen, and to think what nobody
else has thought"
Albert Szent-Gyorgi
On Apr 28, 2015 10:45 PM, "Laurence Marks" <L-marks at northwestern.edu> wrote:

> Unfortunately it is hard to know what is going on. A google search on
> "Error while reading PMI socket." indicates that the message you have means
> it did not work, and is not specific. Some suggestions:
>
> a) Try mpiexec (slightly different arguments). You just edit
> parallel_options.
> https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
> b) Try an older version of mvapich2 if it is on the system.
> c) Do you have to launch mpdboot for your system
> https://wiki.calculquebec.ca/w/MVAPICH2/en?
> d) Talk to a sys_admin, particularly the one who setup mvapich
> e) Do "cat *.error", maybe something else went wrong or it is not mpi's
> fault but a user error.
>
> ___________________________
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> www.numis.northwestern.edu
> MURI4D.numis.northwestern.edu
> Co-Editor, Acta Cryst A
> "Research is to see what everybody else has seen, and to think what nobody
> else has thought"
> Albert Szent-Gyorgi
> On Apr 28, 2015 10:17 PM, "lung Fermin" <ferminlung at gmail.com> wrote:
>
>>  Thanks for Prof. Marks' comment.
>>
>> 1. In the previous email, I have missed to copy the line
>>
>> setenv WIEN_MPIRUN "/usr/local/mvapich2-icc/bin/mpirun -np _NP_ -hostfile
>> _HOSTS_ _EXEC_"
>> It was in the parallel_option. Sorry about that.
>>
>> 2. I have checked that the running program was lapw1c_mpi. Besides, when
>> the mpi calculation was done on a single node for some other system, the
>> results are consistent with the literature. So I believe that the mpi code
>> has been setup and compiled properly.
>>
>> Would there be something wrong with my option in siteconfig..? Do I have
>> to set some command to bind the job? Any other possible cause of the error?
>>
>> Any suggestions or comments would be appreciated. Thanks.
>>
>>
>>  Regards,
>>
>> Fermin
>>
>>
>> ----------------------------------------------------------------------------------------------------
>>
>> You appear to be missing the line
>>
>> setenv WIEN_MPIRUN=...
>>
>> This is setup when you run siteconfig, and provides the information on
>> how mpi is run on your system.
>>
>> N.B., did you setup and compile the mpi code?
>>
>> ___________________________
>> Professor Laurence Marks
>> Department of Materials Science and Engineering
>> Northwestern University
>> www.numis.northwestern.edu
>> MURI4D.numis.northwestern.edu
>> Co-Editor, Acta Cryst A
>> "Research is to see what everybody else has seen, and to think what
>> nobody else has thought"
>> Albert Szent-Gyorgi
>>
>> On Apr 28, 2015 4:22 AM, "lung Fermin" <ferminlung at gmail.com> wrote:
>>
>> Dear Wien2k community,
>>
>>
>>
>> I am trying to perform calculation on a system of ~100 in-equivalent
>> atoms using mpi+k point parallelization on a cluster. Everything goes fine
>> when the program was run on a single node. However, if I perform the
>> calculation across different nodes, the follow error occurs. How to solve
>> this problem? I am a newbie to mpi programming, any help would be
>> appreciated. Thanks.
>>
>>
>>
>> The error message (MVAPICH2 2.0a):
>>
>>
>> ---------------------------------------------------------------------------------------------------
>>
>> Warning: no access to tty (Bad file descriptor).
>>
>> Thus no job control in this shell.
>>
>> z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
>> z1-2 z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1
>>
>> -13 z1-13 z1-13 z1-13 z1-13 z1-13
>>
>> number of processors: 32
>>
>>  LAPW0 END
>>
>> [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-13
>> aborted: Error while reading a PMI socket (4)
>>
>> [z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546)
>> terminated with signal 9 -> abort job
>>
>> [z1-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 8.
>> MPI process died?
>>
>> [z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
>> process died?
>>
>> [z1-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 12.
>> MPI process died?
>>
>> [z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
>> process died?
>>
>> [z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454)
>> terminated with signal 9 -> abort job
>>
>> [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-2
>> aborted: MPI process error (1)
>>
>> [cli_15]: aborting job:
>>
>> application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15
>>
>>
>>
>> >   stop error
>>
>>
>> ------------------------------------------------------------------------------------------------------
>>
>>
>>
>> The .machines file:
>>
>> #
>>
>> 1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
>> z1-2 z1-2
>>
>> 1:z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13
>> z1-13 z1-13 z1-13 z1-13
>>
>> granularity:1
>>
>> extrafine:1
>>
>>
>> --------------------------------------------------------------------------------------------------------
>>
>> The parallel_options:
>>
>>
>>
>> setenv TASKSET "no"
>>
>> setenv USE_REMOTE 0
>>
>> setenv MPI_REMOTE 1
>>
>> setenv WIEN_GRANULARITY 1
>>
>>
>>
>>
>> --------------------------------------------------------------------------------------------------------
>>
>>
>>
>> Thanks.
>>
>>
>>
>> Regards,
>>
>> Fermin
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20150429/db328404/attachment.html>


More information about the Wien mailing list