[Wien] Error in mpi+k point parallelization across multiple nodes

Peter Blaha pblaha at theochem.tuwien.ac.at
Mon May 4 07:04:33 CEST 2015


It seems as if lapw0_mpi runs properly ?? Please check if you have
NEW (check date with ls -als)!! valid case.vsp/vns files, which can be used in
eg. a sequential lapw1 step.

This suggests that   mpi and fftw are ok.

The problems seem to start in lapw1_mpi, and this program requires in addition to
mpi also scalapack.

I guess you compile with ifort and link with the mkl ??
There is one crucial blacs library, which must be adapted to your mpi, since they
are specific to a particular mpi (intelmpi, openmpi, ...):
Which blacks-library do you link ?   -lmkl_blacs_lp64   or another one ??
Check out the doku for the mkl.


Am 04.05.2015 um 05:18 schrieb lung Fermin:
> I have tried to set MPI_REMOTE=0 and used 32 cores (on 2 nodes) for distributing the mpi job. However, the problem still persist... but the error message looks different
> this time:
>
> $> cat *.error
> Error in LAPW2
> **  testerror: Error in Parallel LAPW2
>
> and the output on screen:
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18
> z1-18 z1-1
> 8 z1-18 z1-18
> number of processors: 32
>   LAPW0 END
> [16] Failed to dealloc pd (Device or resource busy)
> [0] Failed to dealloc pd (Device or resource busy)
> [17] Failed to dealloc pd (Device or resource busy)
> [2] Failed to dealloc pd (Device or resource busy)
> [18] Failed to dealloc pd (Device or resource busy)
> [1] Failed to dealloc pd (Device or resource busy)
>   LAPW1 END
> LAPW2 - FERMI; weighs written
> [z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291) terminated with signal 9 -> abort job
> [z1-17:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 9. MPI process died?
> [z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
> [z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-17 aborted: Error while reading a PMI socket (4)
> [z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21. MPI process died?
> [z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21. MPI process died?
> [z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process died?
> cp: cannot stat `.in.tmp': No such file or directory
>
>  >   stop error
>
>
> ------------------------------------------------------------------------------------------------------------
>
> Try setting
>
> setenv MPI_REMOTE 0
>
> in parallel options.
>
> Am 29.04.2015 um 09:44 schrieb lung Fermin:
>
>> Thanks for your comment, Prof.  Marks.
>
>>
>
>> Each node on the cluster has 32GB  memory and each core (16 in total)
>
>> on the node is limited to 2GB of  memory usage. For the current system,
>
>> I used RKMAX=6,  and the smallest RMT=2.25.
>
>>
>
>> I have tested the calculation with  single k point and mpi on 16 cores
>
>> within a node. The matrix size from
>
>>
>
>> $ cat *.nmat_only
>
>>
>
>> is       29138
>
>>
>
>> Does this mean that the number of  matrix elements is 29138 or (29138)^2?
>
>> In general, how shall I estimate  the memory required for a calculation?
>
>>
>
>> I have also checked the memory  usage with "top" on the node. Each core
>
>> has used up ~5% of the memory and  this adds up to ~5*16% on the node.
>
>> Perhaps the problem is really  caused by the overflow of memory.. I am
>
>> now queuing on the cluster to test  for the case of mpi over 32 cores
>
>> (2 nodes).
>
>>
>
>> Thanks.
>
>>
>
>> Regards,
>
>> Fermin
>
>>
>
>>  ----------------------------------------------------------------------
>
>>  ------------------------------------------
>
>>
>
>> As an addendum, the calculation may  be too big for a single node. How
>
>> much memory does the node have,  what is the RKMAX, the smallest RMT &
>
>> unit cell size? Maybe use in your  machines file
>
>>
>
>> 1:z1-2:16 z1-13:16
>
>> lapw0: z1-2:16 z1-13:16
>
>> granularity:1
>
>> extrafine:1
>
>>
>
>> Check the size using
>
>> x law1 -c -p -nmat_only
>
>> cat *.nmat
>
>>
>
>> ___________________________
>
>> Professor Laurence Marks
>
>> Department of Materials Science and  Engineering Northwestern
>
>> Universitywww.numis.northwestern.edu <http://www.numis.northwestern.edu>
>
>> <http://www.numis.northwestern.edu>
>
>>MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu> <http://MURI4D.numis.northwestern.edu>
>
>> Co-Editor, Acta Cryst A
>
>> "Research is to see what  everybody else has seen, and to think what
>
>> nobody else has thought"
>
>> Albert Szent-Gyorgi
>
>>
>
>> On Apr 28, 2015 10:45 PM,  "Laurence Marks" <L-marks at northwestern.edu <mailto:L-marks at northwestern.edu>
>
>> <mailto:L-marks at northwestern.edu>> wrote:
>
>>
>
>> Unfortunately it is hard to know  what is going on. A google search on
>
>> "Error while reading PMI  socket." indicates that the message you have
>
>> means it did not work, and is not  specific. Some suggestions:
>
>>
>
>> a) Try mpiexec (slightly different  arguments). You just edit
>
>> parallel_options.
>
>>https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
>
>> b) Try an older version of mvapich2  if it is on the system.
>
>> c) Do you have to launch mpdboot  for your system
>
>>https://wiki.calculquebec.ca/w/MVAPICH2/en?
>
>> d) Talk to a sys_admin,  particularly the one who setup mvapich
>
>> e) Do "cat *.error",  maybe something else went wrong or it is not
>
>> mpi's fault but a user error.
>
>>
>
>> ___________________________
>
>> Professor Laurence Marks
>
>> Department of Materials Science and  Engineering Northwestern
>
>> Universitywww.numis.northwestern.edu <http://www.numis.northwestern.edu>
>
>> <http://www.numis.northwestern.edu>
>
>>MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu> <http://MURI4D.numis.northwestern.edu>
>
>> Co-Editor, Acta Cryst A
>
>> "Research is to see what everybody  else has seen, and to think what
>
>> nobody else has thought"
>
>> Albert Szent-Gyorgi
>
>>
>
>> On Apr 28, 2015 10:17 PM,  "lung Fermin" <ferminlung at gmail.com <mailto:ferminlung at gmail.com>
>
>> <mailto:ferminlung at gmail.com>> wrote:
>
>>
>
>> Thanks for Prof. Marks' comment.
>
>>
>
>> 1. In the previous email, I have  missed to copy the line
>
>>
>
>> setenv WIEN_MPIRUN  "/usr/local/mvapich2-icc/bin/mpirun -np _NP_
>
>> -hostfile _HOSTS_ _EXEC_"
>
>>
>
>> It was in the parallel_option.  Sorry about that.
>
>>
>
>> 2. I have checked that the running  program was lapw1c_mpi. Besides,
>
>> when the mpi calculation was done  on a single node for some other
>
>> system, the results are consistent  with the literature. So I believe
>
>> that the mpi code has been setup  and compiled properly.
>
>>
>
>> Would there be something wrong with  my option in siteconfig..? Do I
>
>> have to set some command to bind  the job? Any other possible cause of the error?
>
>>
>
>> Any suggestions or comments would  be appreciated. Thanks.
>
>>
>
>> Regards,
>
>>
>
>> Fermin
>
>>
>
>>  ----------------------------------------------------------------------
>
>> ------------------------------
>
>>
>
>> You appear to be missing the line
>
>>
>
>> setenv WIEN_MPIRUN=...
>
>>
>
>> This is setup when you run  siteconfig, and provides the information on
>
>> how mpi is run on your system.
>
>>
>
>> N.B., did you setup and compile the  mpi code?
>
>>
>
>> ___________________________
>
>> Professor Laurence Marks
>
>> Department of Materials Science and  Engineering Northwestern
>
>> Universitywww.numis.northwestern.edu <http://www.numis.northwestern.edu>
>
>> <http://www.numis.northwestern.edu>
>
>>MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu> <http://MURI4D.numis.northwestern.edu>
>
>> Co-Editor, Acta Cryst A
>
>> "Research is to see what  everybody else has seen, and to think what
>
>> nobody else has thought"
>
>> Albert Szent-Gyorgi
>
>>
>
>> On Apr 28, 2015 4:22 AM, "lung  Fermin" <ferminlung at gmail.com <mailto:ferminlung at gmail.com>
>
>> <mailto:ferminlung at gmail.com>> wrote:
>
>>
>
>> Dear Wien2k community,
>
>>
>
>> I am trying to perform calculation  on a system of ~100 in-equivalent
>
>> atoms using mpi+k point  parallelization on a cluster. Everything goes
>
>> fine when the program was run on a  single node. However, if I perform
>
>> the calculation across different  nodes, the follow error occurs. How
>
>> to solve this problem? I am a  newbie to mpi programming, any help
>
>> would be appreciated. Thanks.
>
>>
>
>> The error message (MVAPICH2 2.0a):
>
>>
>
>>  ----------------------------------------------------------------------
>
>> -----------------------------
>
>>
>
>> Warning: no access to tty (Bad file  descriptor).
>
>>
>
>> Thus no job control in this shell.
>
>>
>
>> z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2  z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
>
>> z1-2 z1-2 z1-13 z1-13 z1-13 z1-13  z1-13 z1-13 z1-13 z1-13 z1-13 z1-13
>
>> z1
>
>>
>
>> -13 z1-13 z1-13 z1-13 z1-13 z1-13
>
>>
>
>> number of processors: 32
>
>>
>
>>LAPW0 END
>
>>
>
>>  [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
>
>> z1-13 aborted: Error while reading  a PMI socket (4)
>
>>
>
>> [z1-13:mpispawn_0][child_handler]  MPI process (rank: 11, pid: 8546)
>
>> terminated with signal 9 ->  abort job
>
>>
>
>> [z1-13:mpispawn_0][readline]  Unexpected End-Of-File on file descriptor
>
>> 8. MPI process died?
>
>>
>
>>  [z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
>
>> MPI process died?
>
>>
>
>> [z1-2:mpispawn_0][readline]  Unexpected End-Of-File on file descriptor
>
>> 12. MPI process died?
>
>>
>
>> [z1-2:mpispawn_0][mtpmi_processops]  Error while reading PMI socket.
>
>> MPI process died?
>
>>
>
>> [z1-2:mpispawn_0][child_handler]  MPI process (rank: 0, pid: 35454)
>
>> terminated with signal 9 ->  abort job
>
>>
>
>>  [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
>
>> z1-2
>
>> aborted: MPI process error (1)
>
>>
>
>> [cli_15]: aborting job:
>
>>
>
>> application called  MPI_Abort(MPI_COMM_WORLD, 0) - process 15
>
>>
>
>>>   stop error
>
>>
>
>>  ----------------------------------------------------------------------
>
>> --------------------------------
>
>>
>
>> The .machines file:
>
>>
>
>> #
>
>>
>
>> 1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2  z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
>
>> z1-2
>
>> z1-2 z1-2
>
>>
>
>> 1:z1-13 z1-13 z1-13 z1-13 z1-13  z1-13 z1-13 z1-13 z1-13 z1-13 z1-13
>
>> z1-13 z1-13 z1-13 z1-13 z1-13
>
>>
>
>> granularity:1
>
>>
>
>> extrafine:1
>
>>
>
>>  ----------------------------------------------------------------------
>
>> ----------------------------------
>
>>
>
>> The parallel_options:
>
>>
>
>> setenv TASKSET "no"
>
>>
>
>> setenv USE_REMOTE 0
>
>>
>
>> setenv MPI_REMOTE 1
>
>>
>
>> setenv WIEN_GRANULARITY 1
>
>>
>
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>

-- 
-----------------------------------------
Peter Blaha
Inst. Materials Chemistry, TU Vienna
Getreidemarkt 9, A-1060 Vienna, Austria
Tel: +43-1-5880115671
Fax: +43-1-5880115698
email: pblaha at theochem.tuwien.ac.at
-----------------------------------------


More information about the Wien mailing list