<p dir="ltr">As an addendum, the calculation may be too big for a single node. How much memory does the node have, what is the RKMAX, the smallest RMT & unit cell size? Maybe use in your machines file</p>
<p dir="ltr">1:z1-2:16 z1-13:16<br>
lapw0: z1-2:16 z1-13:16<br>
granularity:1<br>
extrafine:1</p>
<p dir="ltr">Check the size using <br>
x law1 -c -p -nmat_only<br>
cat *.nmat<br><br></p>
<p dir="ltr">___________________________<br>
Professor Laurence Marks<br>
Department of Materials Science and Engineering<br>
Northwestern University<br>
<a href="http://www.numis.northwestern.edu">www.numis.northwestern.edu</a><br>
<a href="http://MURI4D.numis.northwestern.edu">MURI4D.numis.northwestern.edu</a><br>
Co-Editor, Acta Cryst A<br>
"Research is to see what everybody else has seen, and to think what nobody else has thought"<br>
Albert Szent-Gyorgi</p>
<div class="gmail_quote">On Apr 28, 2015 10:45 PM, "Laurence Marks" <<a href="mailto:L-marks@northwestern.edu">L-marks@northwestern.edu</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><p dir="ltr">Unfortunately it is hard to know what is going on. A google search on "Error while reading PMI socket." indicates that the message you have means it did not work, and is not specific. Some suggestions:</p>
<p dir="ltr">a) Try mpiexec (slightly different arguments). You just edit parallel_options. <a href="https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager" target="_blank">https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager</a><br>
b) Try an older version of mvapich2 if it is on the system.<br>
c) Do you have to launch mpdboot for your system <a href="https://wiki.calculquebec.ca/w/MVAPICH2/en" target="_blank">https://wiki.calculquebec.ca/w/MVAPICH2/en</a>?<br>
d) Talk to a sys_admin, particularly the one who setup mvapich<br>
e) Do "cat *.error", maybe something else went wrong or it is not mpi's fault but a user error.</p>
<p dir="ltr">___________________________<br>
Professor Laurence Marks<br>
Department of Materials Science and Engineering<br>
Northwestern University<br>
<a href="http://www.numis.northwestern.edu" target="_blank">www.numis.northwestern.edu</a><br>
<a href="http://MURI4D.numis.northwestern.edu" target="_blank">MURI4D.numis.northwestern.edu</a><br>
Co-Editor, Acta Cryst A<br>
"Research is to see what everybody else has seen, and to think what nobody else has thought"<br>
Albert Szent-Gyorgi</p>
<div class="gmail_quote">On Apr 28, 2015 10:17 PM, "lung Fermin" <<a href="mailto:ferminlung@gmail.com" target="_blank">ferminlung@gmail.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>
<div dir="ltr">
<p>Thanks for Prof. Marks' comment.</p>
<p>1. In the previous email, I have missed to copy the line</p>
<p>setenv WIEN_MPIRUN "/usr/local/mvapich2-icc/bin/mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_"</p>
<div>It was in the parallel_option. Sorry about that.</div>
<p>2. I have checked that the running program was lapw1c_mpi. Besides, when the mpi calculation was done on a single node for some other system, the results are consistent with the literature. So I believe that the mpi code has been setup and compiled properly. <br>
</p>
<p>Would there be something wrong with my option in siteconfig..? Do I have to set some command to bind the job? Any other possible cause of the error?</p>
<p>Any suggestions or comments would be appreciated. Thanks.</p>
<p><br>
</p>
<p>Regards,</p>
<p>Fermin</p>
<p>----------------------------------------------------------------------------------------------------<br>
</p>
<p><span lang="EN-US">You appear to be missing the line</span></p>
<p><span lang="EN-US">setenv WIEN_MPIRUN=...</span></p>
<p><span lang="EN-US">This is setup when you run siteconfig, and provides the information on how mpi is run on your system.</span></p>
<p><span lang="EN-US">N.B., did you setup and compile the mpi code?</span></p>
<p><span lang="EN-US">___________________________<br>
Professor Laurence Marks<br>
Department of Materials Science and Engineering<br>
Northwestern University<br>
<a href="http://www.numis.northwestern.edu" target="_blank">www.numis.northwestern.edu</a><br>
<a href="http://MURI4D.numis.northwestern.edu" target="_blank">MURI4D.numis.northwestern.edu</a><br>
Co-Editor, Acta Cryst A<br>
"Research is to see what everybody else has seen, and to think what nobody else has thought"<br>
Albert Szent-Gyorgi</span></p>
<p class="MsoNormal"><span lang="EN-US">On Apr 28, 2015 4:22 AM, "lung Fermin" <<a href="mailto:ferminlung@gmail.com" target="_blank">ferminlung@gmail.com</a>> wrote:</span></p>
<p class="MsoNormal"><span lang="EN-US">Dear Wien2k community,</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">I am trying to perform calculation on a system of ~100 in-equivalent atoms using mpi+k point parallelization on a cluster. Everything goes fine when the program was run on a single node. However, if I perform the calculation
across different nodes, the follow error occurs. How to solve this problem? I am a newbie to mpi programming, any help would be appreciated. Thanks.</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">The error message (MVAPICH2 2.0a):</span></p>
<p class="MsoNormal"><span lang="EN-US">---------------------------------------------------------------------------------------------------</span></p>
<p class="MsoNormal"><span lang="EN-US">Warning: no access to tty (Bad file descriptor).</span></p>
<p class="MsoNormal"><span lang="EN-US">Thus no job control in this shell.</span></p>
<p class="MsoNormal"><span lang="EN-US">z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1</span></p>
<p class="MsoNormal"><span lang="EN-US">-13 z1-13 z1-13 z1-13 z1-13 z1-13</span></p>
<p class="MsoNormal"><span lang="EN-US">number of processors: 32</span></p>
<p class="MsoNormal"><span lang="EN-US"> LAPW0 END</span></p>
<p class="MsoNormal"><span lang="EN-US">[z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-13 aborted: Error while reading a PMI socket (4)</span></p>
<p class="MsoNormal"><span lang="EN-US">[z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546) terminated with signal 9 -> abort job</span></p>
<p class="MsoNormal"><span lang="EN-US">[z1-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 8. MPI process died?</span></p>
<p class="MsoNormal"><span lang="EN-US">[z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?</span></p>
<p class="MsoNormal"><span lang="EN-US">[z1-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 12. MPI process died?</span></p>
<p class="MsoNormal"><span lang="EN-US">[z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?</span></p>
<p class="MsoNormal"><span lang="EN-US">[z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454) terminated with signal 9 -> abort job</span></p>
<p class="MsoNormal"><span lang="EN-US">[z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-2 aborted: MPI process error (1)</span></p>
<p class="MsoNormal"><span lang="EN-US">[cli_15]: aborting job:</span></p>
<p class="MsoNormal"><span lang="EN-US">application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">> stop error</span></p>
<p class="MsoNormal"><span lang="EN-US">------------------------------------------------------------------------------------------------------</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">The .machines file:</span></p>
<p class="MsoNormal"><span lang="EN-US">#</span></p>
<p class="MsoNormal"><span lang="EN-US">1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2</span></p>
<p class="MsoNormal"><span lang="EN-US">1:z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13</span></p>
<p class="MsoNormal"><span lang="EN-US">granularity:1</span></p>
<p class="MsoNormal"><span lang="EN-US">extrafine:1</span></p>
<p class="MsoNormal"><span lang="EN-US">--------------------------------------------------------------------------------------------------------</span></p>
<p class="MsoNormal"><span lang="EN-US">The parallel_options:</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">setenv TASKSET "no"</span></p>
<p class="MsoNormal"><span lang="EN-US">setenv USE_REMOTE 0</span></p>
<p class="MsoNormal"><span lang="EN-US">setenv MPI_REMOTE 1</span></p>
<p class="MsoNormal"><span lang="EN-US">setenv WIEN_GRANULARITY 1</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">--------------------------------------------------------------------------------------------------------</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">Thanks.</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">Regards,</span></p>
<p class="MsoNormal"><span lang="EN-US">Fermin</span></p>
</div>
</div>
</blockquote></div>
</blockquote></div>