<p dir="ltr">I suspect that there is something wrong with your IB and/or how it has been installed. I doubt anyone on the list can help you as it sounds like an OS problem. If you provide the struct file someone might be able to check that it is not a setup problem.</p>
<p dir="ltr">1) Try mpiexec<br>
2) Post to the mvapich2 list.<br>
3) Get help from your sys admin. </p>
<p dir="ltr">___________________________<br>
Professor Laurence Marks<br>
Department of Materials Science and Engineering<br>
Northwestern University<br>
<a href="http://www.numis.northwestern.edu">www.numis.northwestern.edu</a><br>
<a href="http://MURI4D.numis.northwestern.edu">MURI4D.numis.northwestern.edu</a><br>
Co-Editor, Acta Cryst A<br>
"Research is to see what everybody else has seen, and to think what nobody else has thought"<br>
Albert Szent-Gyorgi</p>
<div class="gmail_quote">On May 3, 2015 10:19 PM, "lung Fermin" <<a href="mailto:ferminlung@gmail.com">ferminlung@gmail.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>
<div dir="ltr">I have tried to set MPI_REMOTE=0 and used 32 cores (on 2 nodes) for distributing the mpi job. However, the problem still persist... but the error message looks different this time:
<div><br>
</div>
<div>
<div>$> cat *.error</div>
<div>Error in LAPW2</div>
<div>** testerror: Error in Parallel LAPW2</div>
</div>
<div><br>
</div>
<div>and the output on screen:</div>
<div>
<div>
<div>Warning: no access to tty (Bad file descriptor).</div>
<div>Thus no job control in this shell.</div>
<div>z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-1</div>
<div>8 z1-18 z1-18</div>
<div>number of processors: 32</div>
<div> LAPW0 END</div>
<div>[16] Failed to dealloc pd (Device or resource busy)</div>
<div>[0] Failed to dealloc pd (Device or resource busy)</div>
<div>[17] Failed to dealloc pd (Device or resource busy)</div>
<div>[2] Failed to dealloc pd (Device or resource busy)</div>
<div>[18] Failed to dealloc pd (Device or resource busy)</div>
<div>[1] Failed to dealloc pd (Device or resource busy)</div>
<div> LAPW1 END</div>
<div>LAPW2 - FERMI; weighs written</div>
<div>[z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291) terminated with signal 9 -> abort job</div>
<div>[z1-17:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 9. MPI process died?</div>
<div>[z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?</div>
<div>[z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-17 aborted: Error while reading a PMI socket (4)</div>
<div>[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21. MPI process died?</div>
<div>[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21. MPI process died?</div>
<div>[z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process died?</div>
<div>cp: cannot stat `.in.tmp': No such file or directory</div>
<div><br>
</div>
<div>> stop error</div>
</div>
<div><br>
</div>
<div><br>
</div>
<div>------------------------------------------------------------------------------------------------------------</div>
<div>
<p><span lang="EN-US">Try setting</span></p>
<p><span lang="EN-US">setenv MPI_REMOTE 0</span></p>
<p><span lang="EN-US">in parallel options.</span></p>
<p><span lang="EN-US"> </span></p>
<p><span lang="EN-US">Am 29.04.2015 um 09:44 schrieb lung Fermin:</span></p>
<p><span lang="EN-US">> Thanks for your comment, Prof. Marks.</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> Each node on the cluster has 32GB memory and each core (16 in total)
</span></p>
<p><span lang="EN-US">> on the node is limited to 2GB of memory usage. For the current system,
</span></p>
<p><span lang="EN-US">> I used RKMAX=6, and the smallest RMT=2.25.</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> I have tested the calculation with single k point and mpi on 16 cores
</span></p>
<p><span lang="EN-US">> within a node. The matrix size from</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> $ cat *.nmat_only</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> is 29138</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> Does this mean that the number of matrix elements is 29138 or (29138)^2?</span></p>
<p><span lang="EN-US">> In general, how shall I estimate the memory required for a calculation?</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> I have also checked the memory usage with "top" on the node. Each core
</span></p>
<p><span lang="EN-US">> has used up ~5% of the memory and this adds up to ~5*16% on the node.</span></p>
<p><span lang="EN-US">> Perhaps the problem is really caused by the overflow of memory.. I am
</span></p>
<p><span lang="EN-US">> now queuing on the cluster to test for the case of mpi over 32 cores
</span></p>
<p><span lang="EN-US">> (2 nodes).</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> Thanks.</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> Regards,</span></p>
<p><span lang="EN-US">> Fermin</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> ----------------------------------------------------------------------</span></p>
<p><span lang="EN-US">> ------------------------------------------</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> As an addendum, the calculation may be too big for a single node. How
</span></p>
<p><span lang="EN-US">> much memory does the node have, what is the RKMAX, the smallest RMT &
</span></p>
<p><span lang="EN-US">> unit cell size? Maybe use in your machines file</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> 1:z1-2:16 z1-13:16</span></p>
<p><span lang="EN-US">> lapw0: z1-2:16 z1-13:16</span></p>
<p><span lang="EN-US">> granularity:1</span></p>
<p><span lang="EN-US">> extrafine:1</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> Check the size using</span></p>
<p><span lang="EN-US">> x law1 -c -p -nmat_only</span></p>
<p><span lang="EN-US">> cat *.nmat</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> ___________________________</span></p>
<p><span lang="EN-US">> Professor Laurence Marks</span></p>
<p><span lang="EN-US">> Department of Materials Science and Engineering Northwestern
</span></p>
<p><span lang="EN-US">> University <a href="http://www.numis.northwestern.edu" target="_blank">
www.numis.northwestern.edu</a> </span></p>
<p><span lang="EN-US">> <<a href="http://www.numis.northwestern.edu" target="_blank">http://www.numis.northwestern.edu</a>></span></p>
<p><span lang="EN-US">> <a href="http://MURI4D.numis.northwestern.edu" target="_blank">MURI4D.numis.northwestern.edu</a> <<a href="http://MURI4D.numis.northwestern.edu" target="_blank">http://MURI4D.numis.northwestern.edu</a>></span></p>
<p><span lang="EN-US">> Co-Editor, Acta Cryst A</span></p>
<p><span lang="EN-US">> "Research is to see what everybody else has seen, and to think what
</span></p>
<p><span lang="EN-US">> nobody else has thought"</span></p>
<p><span lang="EN-US">> Albert Szent-Gyorgi</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> On Apr 28, 2015 10:45 PM, "Laurence Marks" <<a href="mailto:L-marks@northwestern.edu" target="_blank">L-marks@northwestern.edu</a>
</span></p>
<p><span lang="EN-US">> <<a href="mailto:L-marks@northwestern.edu" target="_blank">mailto:L-marks@northwestern.edu</a>>> wrote:</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> Unfortunately it is hard to know what is going on. A google search on
</span></p>
<p><span lang="EN-US">> "Error while reading PMI socket." indicates that the message you have
</span></p>
<p><span lang="EN-US">> means it did not work, and is not specific. Some suggestions:</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> a) Try mpiexec (slightly different arguments). You just edit
</span></p>
<p><span lang="EN-US">> parallel_options.</span></p>
<p><span lang="EN-US">> <a href="https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager" target="_blank">
https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager</a></span></p>
<p><span lang="EN-US">> b) Try an older version of mvapich2 if it is on the system.</span></p>
<p><span lang="EN-US">> c) Do you have to launch mpdboot for your system
</span></p>
<p><span lang="EN-US">> <a href="https://wiki.calculquebec.ca/w/MVAPICH2/en" target="_blank">
https://wiki.calculquebec.ca/w/MVAPICH2/en</a>?</span></p>
<p><span lang="EN-US">> d) Talk to a sys_admin, particularly the one who setup mvapich</span></p>
<p><span lang="EN-US">> e) Do "cat *.error", maybe something else went wrong or it is not
</span></p>
<p><span lang="EN-US">> mpi's fault but a user error.</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> ___________________________</span></p>
<p><span lang="EN-US">> Professor Laurence Marks</span></p>
<p><span lang="EN-US">> Department of Materials Science and Engineering Northwestern
</span></p>
<p><span lang="EN-US">> University <a href="http://www.numis.northwestern.edu" target="_blank">
www.numis.northwestern.edu</a> </span></p>
<p><span lang="EN-US">> <<a href="http://www.numis.northwestern.edu" target="_blank">http://www.numis.northwestern.edu</a>></span></p>
<p><span lang="EN-US">> <a href="http://MURI4D.numis.northwestern.edu" target="_blank">MURI4D.numis.northwestern.edu</a> <<a href="http://MURI4D.numis.northwestern.edu" target="_blank">http://MURI4D.numis.northwestern.edu</a>></span></p>
<p><span lang="EN-US">> Co-Editor, Acta Cryst A</span></p>
<p><span lang="EN-US">> "Research is to see what everybody else has seen, and to think what
</span></p>
<p><span lang="EN-US">> nobody else has thought"</span></p>
<p><span lang="EN-US">> Albert Szent-Gyorgi</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> On Apr 28, 2015 10:17 PM, "lung Fermin" <<a href="mailto:ferminlung@gmail.com" target="_blank">ferminlung@gmail.com</a>
</span></p>
<p><span lang="EN-US">> <<a href="mailto:ferminlung@gmail.com" target="_blank">mailto:ferminlung@gmail.com</a>>> wrote:</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> Thanks for Prof. Marks' comment.</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> 1. In the previous email, I have missed to copy the line</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> setenv WIEN_MPIRUN "/usr/local/mvapich2-icc/bin/mpirun -np _NP_
</span></p>
<p><span lang="EN-US">> -hostfile _HOSTS_ _EXEC_"</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> It was in the parallel_option. Sorry about that.</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> 2. I have checked that the running program was lapw1c_mpi. Besides,
</span></p>
<p><span lang="EN-US">> when the mpi calculation was done on a single node for some other
</span></p>
<p><span lang="EN-US">> system, the results are consistent with the literature. So I believe
</span></p>
<p><span lang="EN-US">> that the mpi code has been setup and compiled properly.</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> Would there be something wrong with my option in siteconfig..? Do I
</span></p>
<p><span lang="EN-US">> have to set some command to bind the job? Any other possible cause of the error?</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> Any suggestions or comments would be appreciated. Thanks.</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> Regards,</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> Fermin</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> ----------------------------------------------------------------------</span></p>
<p><span lang="EN-US">> ------------------------------</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> You appear to be missing the line</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> setenv WIEN_MPIRUN=...</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> This is setup when you run siteconfig, and provides the information on
</span></p>
<p><span lang="EN-US">> how mpi is run on your system.</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> N.B., did you setup and compile the mpi code?</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> ___________________________</span></p>
<p><span lang="EN-US">> Professor Laurence Marks</span></p>
<p><span lang="EN-US">> Department of Materials Science and Engineering Northwestern
</span></p>
<p><span lang="EN-US">> University <a href="http://www.numis.northwestern.edu" target="_blank">
www.numis.northwestern.edu</a> </span></p>
<p><span lang="EN-US">> <<a href="http://www.numis.northwestern.edu" target="_blank">http://www.numis.northwestern.edu</a>></span></p>
<p><span lang="EN-US">> <a href="http://MURI4D.numis.northwestern.edu" target="_blank">MURI4D.numis.northwestern.edu</a> <<a href="http://MURI4D.numis.northwestern.edu" target="_blank">http://MURI4D.numis.northwestern.edu</a>></span></p>
<p><span lang="EN-US">> Co-Editor, Acta Cryst A</span></p>
<p><span lang="EN-US">> "Research is to see what everybody else has seen, and to think what
</span></p>
<p><span lang="EN-US">> nobody else has thought"</span></p>
<p><span lang="EN-US">> Albert Szent-Gyorgi</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> On Apr 28, 2015 4:22 AM, "lung Fermin" <<a href="mailto:ferminlung@gmail.com" target="_blank">ferminlung@gmail.com</a>
</span></p>
<p><span lang="EN-US">> <<a href="mailto:ferminlung@gmail.com" target="_blank">mailto:ferminlung@gmail.com</a>>> wrote:</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> Dear Wien2k community,</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> I am trying to perform calculation on a system of ~100 in-equivalent
</span></p>
<p><span lang="EN-US">> atoms using mpi+k point parallelization on a cluster. Everything goes
</span></p>
<p><span lang="EN-US">> fine when the program was run on a single node. However, if I perform
</span></p>
<p><span lang="EN-US">> the calculation across different nodes, the follow error occurs. How
</span></p>
<p><span lang="EN-US">> to solve this problem? I am a newbie to mpi programming, any help
</span></p>
<p><span lang="EN-US">> would be appreciated. Thanks.</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> The error message (MVAPICH2 2.0a):</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> ----------------------------------------------------------------------</span></p>
<p><span lang="EN-US">> -----------------------------</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> Warning: no access to tty (Bad file descriptor).</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> Thus no job control in this shell.</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2</span></p>
<p><span lang="EN-US">> z1-2 z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13
</span></p>
<p><span lang="EN-US">> z1</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> -13 z1-13 z1-13 z1-13 z1-13 z1-13</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> number of processors: 32</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> LAPW0 END</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node</span></p>
<p><span lang="EN-US">> z1-13 aborted: Error while reading a PMI socket (4)</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> [z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546)
</span></p>
<p><span lang="EN-US">> terminated with signal 9 -> abort job</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> [z1-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
</span></p>
<p><span lang="EN-US">> 8. MPI process died?</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> [z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
</span></p>
<p><span lang="EN-US">> MPI process died?</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> [z1-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
</span></p>
<p><span lang="EN-US">> 12. MPI process died?</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> [z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
</span></p>
<p><span lang="EN-US">> MPI process died?</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> [z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454)
</span></p>
<p><span lang="EN-US">> terminated with signal 9 -> abort job</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
</span></p>
<p><span lang="EN-US">> z1-2</span></p>
<p><span lang="EN-US">> aborted: MPI process error (1)</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> [cli_15]: aborting job:</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">>> stop error</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> ----------------------------------------------------------------------</span></p>
<p><span lang="EN-US">> --------------------------------</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> The .machines file:</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> #</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> 1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
</span></p>
<p><span lang="EN-US">> z1-2</span></p>
<p><span lang="EN-US">> z1-2 z1-2</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> 1:z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13</span></p>
<p><span lang="EN-US">> z1-13 z1-13 z1-13 z1-13 z1-13</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> granularity:1</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> extrafine:1</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> ----------------------------------------------------------------------</span></p>
<p><span lang="EN-US">> ----------------------------------</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> The parallel_options:</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> setenv TASKSET "no"</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> setenv USE_REMOTE 0</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> setenv MPI_REMOTE 1</span></p>
<p><span lang="EN-US">> </span></p>
<p><span lang="EN-US">> setenv WIEN_GRANULARITY 1</span></p>
<p><span lang="EN-US">> </span></p>
<p> <br>
</p>
</div>
</div>
</div>
</div>
</blockquote></div>