<div dir="ltr"><p>Thanks for your comment, Prof. Marks.</p><p>Each node on the cluster has 32GB memory and each core (16 in total) on the node is limited to 2GB of memory usage. For the current system, I used RKMAX=6, and the smallest RMT=2.25.</p><p>I have tested the calculation with single k point and mpi on 16 cores within a node. The matrix size from </p><p>$ cat *.nmat_only</p><p>is 29138</p><p>Does this mean that the number of matrix elements is 29138 or (29138)^2? In general, how shall I estimate the memory required for a calculation? </p><div>I have also checked the memory usage with "top" on the node. Each core has used up ~5% of the memory and this adds up to ~5*16% on the node. Perhaps the problem is really caused by the overflow of memory.. I am now queuing on the cluster to test for the case of mpi over 32 cores (2 nodes).</div><div><br></div><div>Thanks.</div><div><br></div><div>Regards,</div><div>Fermin</div><div><br></div><p><span lang="EN-US">----------------------------------------------------------------------------------------------------------------</span></p><p><span lang="EN-US">As an addendum, the calculation may be too big for a single
node. How much memory does the node have, what is the RKMAX, the smallest RMT
& unit cell size? Maybe use in your machines file</span></p>
<p><span lang="EN-US">1:z1-2:16 z1-13:16<br>
lapw0: z1-2:16 z1-13:16<br>
granularity:1<br>
extrafine:1</span></p>
<p style="margin-bottom:12pt"><span lang="EN-US">Check the size using <br>
x law1 -c -p -nmat_only<br>
cat *.nmat</span></p>
<p><span lang="EN-US">___________________________<br>
Professor Laurence Marks<br>
Department of Materials Science and Engineering<br>
Northwestern University<br>
<a href="http://www.numis.northwestern.edu">www.numis.northwestern.edu</a><br>
<a href="http://MURI4D.numis.northwestern.edu">MURI4D.numis.northwestern.edu</a><br>
Co-Editor, Acta Cryst A<br>
"Research is to see what everybody else has seen, and to think what nobody
else has thought"<br>
Albert Szent-Gyorgi</span></p>
<p class="MsoNormal"><span lang="EN-US">On Apr 28, 2015 10:45 PM, "Laurence
Marks" <<a href="mailto:L-marks@northwestern.edu">L-marks@northwestern.edu</a>>
wrote:</span></p>
<p><span lang="EN-US">Unfortunately it is hard to know what is going on. A google
search on "Error while reading PMI socket." indicates that the
message you have means it did not work, and is not specific. Some suggestions:</span></p>
<p><span lang="EN-US">a) Try mpiexec (slightly different arguments). You just
edit parallel_options. <a href="https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager" target="_blank">https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager</a><br>
b) Try an older version of mvapich2 if it is on the system.<br>
c) Do you have to launch mpdboot for your system <a href="https://wiki.calculquebec.ca/w/MVAPICH2/en" target="_blank">https://wiki.calculquebec.ca/w/MVAPICH2/en</a>?<br>
d) Talk to a sys_admin, particularly the one who setup mvapich<br>
e) Do "cat *.error", maybe something else went wrong or it is not
mpi's fault but a user error.</span></p>
<p><span lang="EN-US">___________________________<br>
Professor Laurence Marks<br>
Department of Materials Science and Engineering<br>
Northwestern University<br>
<a href="http://www.numis.northwestern.edu" target="_blank">www.numis.northwestern.edu</a><br>
<a href="http://MURI4D.numis.northwestern.edu" target="_blank">MURI4D.numis.northwestern.edu</a><br>
Co-Editor, Acta Cryst A<br>
"Research is to see what everybody else has seen, and to think what nobody
else has thought"<br>
Albert Szent-Gyorgi</span></p>
<p class="MsoNormal"><span lang="EN-US">On Apr 28, 2015 10:17 PM, "lung
Fermin" <<a href="mailto:ferminlung@gmail.com" target="_blank">ferminlung@gmail.com</a>>
wrote:</span></p>
<p><span lang="EN-US">Thanks for Prof. Marks' comment.</span></p>
<p><span lang="EN-US">1. In the previous email, I have missed to copy the line</span></p>
<p><span lang="EN-US">setenv WIEN_MPIRUN "/usr/local/mvapich2-icc/bin/mpirun
-np _NP_ -hostfile _HOSTS_ _EXEC_"</span></p>
<p class="MsoNormal"><span lang="EN-US">It was in the parallel_option. Sorry about
that.</span></p>
<p><span lang="EN-US">2. I have checked that the running program was lapw1c_mpi.
Besides, when the mpi calculation was done on a single node for some other
system, the results are consistent with the literature. So I believe that the
mpi code has been setup and compiled properly. </span></p>
<p><span lang="EN-US">Would there be something wrong with my option in
siteconfig..? Do I have to set some command to bind the job? Any other possible
cause of the error?</span></p>
<p><span lang="EN-US">Any suggestions or comments would be appreciated. Thanks.</span></p>
<p><span lang="EN-US"> </span></p>
<p><span lang="EN-US">Regards,</span></p>
<p><span lang="EN-US">Fermin</span></p>
<p><span lang="EN-US">----------------------------------------------------------------------------------------------------</span></p>
<p><span lang="EN-US">You appear to be missing the line</span></p>
<p><span lang="EN-US">setenv WIEN_MPIRUN=...</span></p>
<p><span lang="EN-US">This is setup when you run siteconfig, and provides the
information on how mpi is run on your system.</span></p>
<p><span lang="EN-US">N.B., did you setup and compile the mpi code?</span></p>
<p><span lang="EN-US">___________________________<br>
Professor Laurence Marks<br>
Department of Materials Science and Engineering<br>
Northwestern University<br>
<a href="http://www.numis.northwestern.edu" target="_blank">www.numis.northwestern.edu</a><br>
<a href="http://MURI4D.numis.northwestern.edu" target="_blank">MURI4D.numis.northwestern.edu</a><br>
Co-Editor, Acta Cryst A<br>
"Research is to see what everybody else has seen, and to think what nobody
else has thought"<br>
Albert Szent-Gyorgi</span></p>
<p class="MsoNormal"><span lang="EN-US">On Apr 28, 2015 4:22 AM, "lung Fermin" <<a href="mailto:ferminlung@gmail.com" target="_blank">ferminlung@gmail.com</a>>
wrote:</span></p>
<p class="MsoNormal"><span lang="EN-US">Dear Wien2k community,</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">I am trying to perform calculation on a system of ~100 in-equivalent
atoms using mpi+k point parallelization on a cluster. Everything goes fine when
the program was run on a single node. However, if I perform the calculation
across different nodes, the follow error occurs. How to solve this problem? I
am a newbie to mpi programming, any help would be appreciated. Thanks.</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">The error message (MVAPICH2 2.0a):</span></p>
<p class="MsoNormal"><span lang="EN-US">---------------------------------------------------------------------------------------------------</span></p>
<p class="MsoNormal"><span lang="EN-US">Warning: no access to tty (Bad file descriptor).</span></p>
<p class="MsoNormal"><span lang="EN-US">Thus no job control in this shell.</span></p>
<p class="MsoNormal"><span lang="EN-US">z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
z1-2 z1-2 z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1</span></p>
<p class="MsoNormal"><span lang="EN-US">-13 z1-13 z1-13 z1-13 z1-13 z1-13</span></p>
<p class="MsoNormal"><span lang="EN-US">number of processors: 32</span></p>
<p class="MsoNormal"><span lang="EN-US"> LAPW0 END</span></p>
<p class="MsoNormal"><span lang="EN-US">[z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
z1-13 aborted: Error while reading a PMI socket (4)</span></p>
<p class="MsoNormal"><span lang="EN-US">[z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546)
terminated with signal 9 -> abort job</span></p>
<p class="MsoNormal"><span lang="EN-US">[z1-13:mpispawn_0][readline] Unexpected End-Of-File on file
descriptor 8. MPI process died?</span></p>
<p class="MsoNormal"><span lang="EN-US">[z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
MPI process died?</span></p>
<p class="MsoNormal"><span lang="EN-US">[z1-2:mpispawn_0][readline] Unexpected End-Of-File on file
descriptor 12. MPI process died?</span></p>
<p class="MsoNormal"><span lang="EN-US">[z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
MPI process died?</span></p>
<p class="MsoNormal"><span lang="EN-US">[z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454)
terminated with signal 9 -> abort job</span></p>
<p class="MsoNormal"><span lang="EN-US">[z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
z1-2 aborted: MPI process error (1)</span></p>
<p class="MsoNormal"><span lang="EN-US">[cli_15]: aborting job:</span></p>
<p class="MsoNormal"><span lang="EN-US">application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">> stop error</span></p>
<p class="MsoNormal"><span lang="EN-US">------------------------------------------------------------------------------------------------------</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">The .machines file:</span></p>
<p class="MsoNormal"><span lang="EN-US">#</span></p>
<p class="MsoNormal"><span lang="EN-US">1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
z1-2 z1-2 z1-2</span></p>
<p class="MsoNormal"><span lang="EN-US">1:z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13
z1-13 z1-13 z1-13 z1-13 z1-13</span></p>
<p class="MsoNormal"><span lang="EN-US">granularity:1</span></p>
<p class="MsoNormal"><span lang="EN-US">extrafine:1</span></p>
<p class="MsoNormal"><span lang="EN-US">--------------------------------------------------------------------------------------------------------</span></p>
<p class="MsoNormal"><span lang="EN-US">The parallel_options:</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">setenv TASKSET "no"</span></p>
<p class="MsoNormal"><span lang="EN-US">setenv USE_REMOTE 0</span></p>
<p class="MsoNormal"><span lang="EN-US">setenv MPI_REMOTE 1</span></p>
<p class="MsoNormal"><span lang="EN-US">setenv WIEN_GRANULARITY 1</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">--------------------------------------------------------------------------------------------------------</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">Thanks.</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">Regards,</span></p>
<p class="MsoNormal"><span lang="EN-US">Fermin</span></p></div>