<div dir="ltr">I have tried to set MPI_REMOTE=0 and used 32 cores (on 2 nodes) for distributing the mpi job. However, the problem still persist... but the error message looks different this time:<div><br></div><div><div>$> cat *.error</div><div>Error in LAPW2</div><div>** testerror: Error in Parallel LAPW2</div></div><div><br></div><div>and the output on screen:</div><div><div><div>Warning: no access to tty (Bad file descriptor).</div><div>Thus no job control in this shell.</div><div>z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-1</div><div>8 z1-18 z1-18</div><div>number of processors: 32</div><div> LAPW0 END</div><div>[16] Failed to dealloc pd (Device or resource busy)</div><div>[0] Failed to dealloc pd (Device or resource busy)</div><div>[17] Failed to dealloc pd (Device or resource busy)</div><div>[2] Failed to dealloc pd (Device or resource busy)</div><div>[18] Failed to dealloc pd (Device or resource busy)</div><div>[1] Failed to dealloc pd (Device or resource busy)</div><div> LAPW1 END</div><div>LAPW2 - FERMI; weighs written</div><div>[z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291) terminated with signal 9 -> abort job</div><div>[z1-17:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 9. MPI process died?</div><div>[z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?</div><div>[z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-17 aborted: Error while reading a PMI socket (4)</div><div>[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21. MPI process died?</div><div>[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21. MPI process died?</div><div>[z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process died?</div><div>cp: cannot stat `.in.tmp': No such file or directory</div><div><br></div><div>> stop error</div></div><div><br></div><div><br></div><div>------------------------------------------------------------------------------------------------------------</div><div><p class=""><span lang="EN-US">Try setting</span></p>
<p class=""><span lang="EN-US">setenv MPI_REMOTE 0</span></p>
<p class=""><span lang="EN-US">in parallel options.</span></p>
<p class=""><span lang="EN-US"> </span></p>
<p class=""><span lang="EN-US">Am 29.04.2015 um 09:44 schrieb lung
Fermin:</span></p>
<p class=""><span lang="EN-US">> Thanks for your comment, Prof.
Marks.</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> Each node on the cluster has 32GB
memory and each core (16 in total) </span></p>
<p class=""><span lang="EN-US">> on the node is limited to 2GB of
memory usage. For the current system, </span></p>
<p class=""><span lang="EN-US">> I used RKMAX=6, and the smallest RMT=2.25.</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> I have tested the calculation with
single k point and mpi on 16 cores </span></p>
<p class=""><span lang="EN-US">> within a node. The matrix size from</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> $ cat *.nmat_only</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> is 29138</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> Does this mean that the number of
matrix elements is 29138 or (29138)^2?</span></p>
<p class=""><span lang="EN-US">> In general, how shall I estimate
the memory required for a calculation?</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> I have also checked the memory
usage with "top" on the node. Each core </span></p>
<p class=""><span lang="EN-US">> has used up ~5% of the memory and
this adds up to ~5*16% on the node.</span></p>
<p class=""><span lang="EN-US">> Perhaps the problem is really
caused by the overflow of memory.. I am </span></p>
<p class=""><span lang="EN-US">> now queuing on the cluster to test
for the case of mpi over 32 cores </span></p>
<p class=""><span lang="EN-US">> (2 nodes).</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> Thanks.</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> Regards,</span></p>
<p class=""><span lang="EN-US">> Fermin</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">>
----------------------------------------------------------------------</span></p>
<p class=""><span lang="EN-US">>
------------------------------------------</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> As an addendum, the calculation may
be too big for a single node. How </span></p>
<p class=""><span lang="EN-US">> much memory does the node have,
what is the RKMAX, the smallest RMT & </span></p>
<p class=""><span lang="EN-US">> unit cell size? Maybe use in your
machines file</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> 1:z1-2:16 z1-13:16</span></p>
<p class=""><span lang="EN-US">> lapw0: z1-2:16 z1-13:16</span></p>
<p class=""><span lang="EN-US">> granularity:1</span></p>
<p class=""><span lang="EN-US">> extrafine:1</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> Check the size using</span></p>
<p class=""><span lang="EN-US">> x law1 -c -p -nmat_only</span></p>
<p class=""><span lang="EN-US">> cat *.nmat</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> ___________________________</span></p>
<p class=""><span lang="EN-US">> Professor Laurence Marks</span></p>
<p class=""><span lang="EN-US">> Department of Materials Science and
Engineering Northwestern </span></p>
<p class=""><span lang="EN-US">> University <a href="http://www.numis.northwestern.edu">www.numis.northwestern.edu</a> </span></p>
<p class=""><span lang="EN-US">> <<a href="http://www.numis.northwestern.edu">http://www.numis.northwestern.edu</a>></span></p>
<p class=""><span lang="EN-US">> <a href="http://MURI4D.numis.northwestern.edu">MURI4D.numis.northwestern.edu</a> <<a href="http://MURI4D.numis.northwestern.edu">http://MURI4D.numis.northwestern.edu</a>></span></p>
<p class=""><span lang="EN-US">> Co-Editor, Acta Cryst A</span></p>
<p class=""><span lang="EN-US">> "Research is to see what
everybody else has seen, and to think what </span></p>
<p class=""><span lang="EN-US">> nobody else has thought"</span></p>
<p class=""><span lang="EN-US">> Albert Szent-Gyorgi</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> On Apr 28, 2015 10:45 PM,
"Laurence Marks" <<a href="mailto:L-marks@northwestern.edu">L-marks@northwestern.edu</a> </span></p>
<p class=""><span lang="EN-US">> <<a href="mailto:L-marks@northwestern.edu">mailto:L-marks@northwestern.edu</a>>>
wrote:</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> Unfortunately it is hard to know
what is going on. A google search on </span></p>
<p class=""><span lang="EN-US">> "Error while reading PMI
socket." indicates that the message you have </span></p>
<p class=""><span lang="EN-US">> means it did not work, and is not
specific. Some suggestions:</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> a) Try mpiexec (slightly different
arguments). You just edit </span></p>
<p class=""><span lang="EN-US">> parallel_options.</span></p>
<p class=""><span lang="EN-US">> <a href="https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager">https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager</a></span></p>
<p class=""><span lang="EN-US">> b) Try an older version of mvapich2
if it is on the system.</span></p>
<p class=""><span lang="EN-US">> c) Do you have to launch mpdboot
for your system </span></p>
<p class=""><span lang="EN-US">> <a href="https://wiki.calculquebec.ca/w/MVAPICH2/en">https://wiki.calculquebec.ca/w/MVAPICH2/en</a>?</span></p>
<p class=""><span lang="EN-US">> d) Talk to a sys_admin,
particularly the one who setup mvapich</span></p>
<p class=""><span lang="EN-US">> e) Do "cat *.error",
maybe something else went wrong or it is not </span></p>
<p class=""><span lang="EN-US">> mpi's fault but a user error.</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> ___________________________</span></p>
<p class=""><span lang="EN-US">> Professor Laurence Marks</span></p>
<p class=""><span lang="EN-US">> Department of Materials Science and
Engineering Northwestern </span></p>
<p class=""><span lang="EN-US">> University <a href="http://www.numis.northwestern.edu">www.numis.northwestern.edu</a> </span></p>
<p class=""><span lang="EN-US">> <<a href="http://www.numis.northwestern.edu">http://www.numis.northwestern.edu</a>></span></p>
<p class=""><span lang="EN-US">> <a href="http://MURI4D.numis.northwestern.edu">MURI4D.numis.northwestern.edu</a> <<a href="http://MURI4D.numis.northwestern.edu">http://MURI4D.numis.northwestern.edu</a>></span></p>
<p class=""><span lang="EN-US">> Co-Editor, Acta Cryst A</span></p>
<p class=""><span lang="EN-US">> "Research is to see what everybody
else has seen, and to think what </span></p>
<p class=""><span lang="EN-US">> nobody else has thought"</span></p>
<p class=""><span lang="EN-US">> Albert Szent-Gyorgi</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> On Apr 28, 2015 10:17 PM,
"lung Fermin" <<a href="mailto:ferminlung@gmail.com">ferminlung@gmail.com</a> </span></p>
<p class=""><span lang="EN-US">> <<a href="mailto:ferminlung@gmail.com">mailto:ferminlung@gmail.com</a>>>
wrote:</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> Thanks for Prof. Marks' comment.</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> 1. In the previous email, I have
missed to copy the line</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> setenv WIEN_MPIRUN
"/usr/local/mvapich2-icc/bin/mpirun -np _NP_ </span></p>
<p class=""><span lang="EN-US">> -hostfile _HOSTS_ _EXEC_"</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> It was in the parallel_option.
Sorry about that.</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> 2. I have checked that the running
program was lapw1c_mpi. Besides, </span></p>
<p class=""><span lang="EN-US">> when the mpi calculation was done
on a single node for some other </span></p>
<p class=""><span lang="EN-US">> system, the results are consistent
with the literature. So I believe </span></p>
<p class=""><span lang="EN-US">> that the mpi code has been setup
and compiled properly.</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> Would there be something wrong with
my option in siteconfig..? Do I </span></p>
<p class=""><span lang="EN-US">> have to set some command to bind
the job? Any other possible cause of the error?</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> Any suggestions or comments would
be appreciated. Thanks.</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> Regards,</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> Fermin</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">>
----------------------------------------------------------------------</span></p>
<p class=""><span lang="EN-US">> ------------------------------</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> You appear to be missing the line</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> setenv WIEN_MPIRUN=...</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> This is setup when you run
siteconfig, and provides the information on </span></p>
<p class=""><span lang="EN-US">> how mpi is run on your system.</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> N.B., did you setup and compile the
mpi code?</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> ___________________________</span></p>
<p class=""><span lang="EN-US">> Professor Laurence Marks</span></p>
<p class=""><span lang="EN-US">> Department of Materials Science and
Engineering Northwestern </span></p>
<p class=""><span lang="EN-US">> University <a href="http://www.numis.northwestern.edu">www.numis.northwestern.edu</a> </span></p>
<p class=""><span lang="EN-US">> <<a href="http://www.numis.northwestern.edu">http://www.numis.northwestern.edu</a>></span></p>
<p class=""><span lang="EN-US">> <a href="http://MURI4D.numis.northwestern.edu">MURI4D.numis.northwestern.edu</a> <<a href="http://MURI4D.numis.northwestern.edu">http://MURI4D.numis.northwestern.edu</a>></span></p>
<p class=""><span lang="EN-US">> Co-Editor, Acta Cryst A</span></p>
<p class=""><span lang="EN-US">> "Research is to see what
everybody else has seen, and to think what </span></p>
<p class=""><span lang="EN-US">> nobody else has thought"</span></p>
<p class=""><span lang="EN-US">> Albert Szent-Gyorgi</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> On Apr 28, 2015 4:22 AM, "lung
Fermin" <<a href="mailto:ferminlung@gmail.com">ferminlung@gmail.com</a> </span></p>
<p class=""><span lang="EN-US">> <<a href="mailto:ferminlung@gmail.com">mailto:ferminlung@gmail.com</a>>>
wrote:</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> Dear Wien2k community,</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> I am trying to perform calculation
on a system of ~100 in-equivalent </span></p>
<p class=""><span lang="EN-US">> atoms using mpi+k point
parallelization on a cluster. Everything goes </span></p>
<p class=""><span lang="EN-US">> fine when the program was run on a
single node. However, if I perform </span></p>
<p class=""><span lang="EN-US">> the calculation across different
nodes, the follow error occurs. How </span></p>
<p class=""><span lang="EN-US">> to solve this problem? I am a
newbie to mpi programming, any help </span></p>
<p class=""><span lang="EN-US">> would be appreciated. Thanks.</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> The error message (MVAPICH2 2.0a):</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">>
----------------------------------------------------------------------</span></p>
<p class=""><span lang="EN-US">> -----------------------------</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> Warning: no access to tty (Bad file
descriptor).</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> Thus no job control in this shell.</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2</span></p>
<p class=""><span lang="EN-US">> z1-2 z1-2 z1-13 z1-13 z1-13 z1-13
z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 </span></p>
<p class=""><span lang="EN-US">> z1</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> -13 z1-13 z1-13 z1-13 z1-13 z1-13</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> number of processors: 32</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">>
LAPW0 END</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">>
[z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node</span></p>
<p class=""><span lang="EN-US">> z1-13 aborted: Error while reading
a PMI socket (4)</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> [z1-13:mpispawn_0][child_handler]
MPI process (rank: 11, pid: 8546) </span></p>
<p class=""><span lang="EN-US">> terminated with signal 9 ->
abort job</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> [z1-13:mpispawn_0][readline]
Unexpected End-Of-File on file descriptor </span></p>
<p class=""><span lang="EN-US">> 8. MPI process died?</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">>
[z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket. </span></p>
<p class=""><span lang="EN-US">> MPI process died?</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> [z1-2:mpispawn_0][readline]
Unexpected End-Of-File on file descriptor </span></p>
<p class=""><span lang="EN-US">> 12. MPI process died?</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> [z1-2:mpispawn_0][mtpmi_processops]
Error while reading PMI socket. </span></p>
<p class=""><span lang="EN-US">> MPI process died?</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> [z1-2:mpispawn_0][child_handler]
MPI process (rank: 0, pid: 35454) </span></p>
<p class=""><span lang="EN-US">> terminated with signal 9 ->
abort job</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">>
[z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node </span></p>
<p class=""><span lang="EN-US">> z1-2</span></p>
<p class=""><span lang="EN-US">> aborted: MPI process error (1)</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> [cli_15]: aborting job:</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> application called
MPI_Abort(MPI_COMM_WORLD, 0) - process 15</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">>> stop error</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">>
----------------------------------------------------------------------</span></p>
<p class=""><span lang="EN-US">> --------------------------------</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> The .machines file:</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> #</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> 1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 </span></p>
<p class=""><span lang="EN-US">> z1-2</span></p>
<p class=""><span lang="EN-US">> z1-2 z1-2</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> 1:z1-13 z1-13 z1-13 z1-13 z1-13
z1-13 z1-13 z1-13 z1-13 z1-13 z1-13</span></p>
<p class=""><span lang="EN-US">> z1-13 z1-13 z1-13 z1-13 z1-13</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> granularity:1</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> extrafine:1</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">>
----------------------------------------------------------------------</span></p>
<p class=""><span lang="EN-US">> ----------------------------------</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> The parallel_options:</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> setenv TASKSET "no"</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> setenv USE_REMOTE 0</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> setenv MPI_REMOTE 1</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""><span lang="EN-US">> setenv WIEN_GRANULARITY 1</span></p>
<p class=""><span lang="EN-US">> </span></p>
<p class=""> <br></p></div></div></div>