<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">Ok, now it is clear that there is no
additional error messages. Unfortunately, I cannot tell
specifically what went wrong from those error messages.<br>
<br>
You might try replacing mpirun with mpirun_rsh. As you can see at<br>
<br>
<a class="moz-txt-link-freetext" href="http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2013-May/004402.html">http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2013-May/004402.html</a><br>
<br>
they replaced mpirun with mpirun_rsh, and it seems that they found
a problem with passwordless ssh. <br>
<br>
In your parallel_options file, you might also want to change
"setenv USE_REMOTE 0" to "setenv USE_REMOTE 1", then try both 0
and 1 for MPI_REMOTE to check if any of these other configurations
work or not while using mpirun.<br>
<br>
On 5/6/2015 11:23 AM, lung Fermin wrote:<br>
</div>
<blockquote
cite="mid:CAFZG4C7WcGmZ9_Z1Tg3+1UHxwucg+O5+rwy-D_T5S7nH_iY2-w@mail.gmail.com"
type="cite">
<div dir="ltr">
<div><font color="#500050"><span style="font-size:14px">Thanks
for the reply. Please see below.</span></font></div>
<div><span class="im" style="font-size:14px"><br>
</span></div>
<span class="im" style="font-size:14px">
<div><br>
</div>
</span>
<div bgcolor="#FFFFFF" text="#000000" style="font-size:14px">>As
I asked before, did you give us all the error information in
the case.dayfile and from standard output? It is not entirely
clear in your previous posts, but it looks to me that you
might have only provided information from the case.dayfile and
the error files (cat *.error), but maybe not from the standard
output. Are you still using the PBS script in your old post
at <a moz-do-not-send="true"
href="http://www.mail-archive.com/wien%40zeus.theochem.tuwien.ac.at/msg11770.html"
target="_blank">http://www.mail-archive.com/wien%40zeus.theochem.tuwien.ac.at/msg11770.html</a> ?
In the script, I can see that the standard output is set to be
written to a file called wien2k_output.<br>
<br>
<br>
Sorry for the confusion. Yes, I still use the PBS script in
the above link. The posts before are from the standard outputs
(wien2k). When using 2 nodes with 32 cores for one k point,
the standard output gives</div>
<div bgcolor="#FFFFFF" text="#000000" style="">
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">----------------------------</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">Warning: no access to tty (Bad file
descriptor).</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">Thus no job control in this shell.</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">z1-17 z1-17 z1-17 z1-17 z1-17 z1-17
z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17
z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-1</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">8 z1-18 z1-18 z1-18 z1-18 z1-18
z1-18 z1-18 z1-18</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">number of processors: 32</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px"> LAPW0 END</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">[16] Failed to dealloc pd (Device
or resource busy)</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">[0] Failed to dealloc pd (Device or
resource busy)</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">[17] Failed to dealloc pd (Device
or resource busy)</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">[2] Failed to dealloc pd (Device or
resource busy)</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">[18] Failed to dealloc pd (Device
or resource busy)</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">[1] Failed to dealloc pd (Device or
resource busy)</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px"> LAPW1 END</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">LAPW2 - FERMI; weighs written</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">[z1-17:mpispawn_0][child_handler]
MPI process (rank: 0, pid: 28291) terminated with signal 9
-> abort job</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">[z1-17:mpispawn_0][readline]
Unexpected End-Of-File on file descriptor 9. MPI process
died?</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">[z1-17:mpispawn_0][mtpmi_processops]
Error while reading PMI socket. MPI process died?</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">[z1-17:mpirun_rsh][process_mpispawn_connection]
mpispawn_0 from node z1-17 aborted: Error while reading a
PMI socket (4)</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">[z1-18:mpispawn_1][read_size]
Unexpected End-Of-File on file descriptor 21. MPI process
died?</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">[z1-18:mpispawn_1][read_size]
Unexpected End-Of-File on file descriptor 21. MPI process
died?</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">[z1-18:mpispawn_1][handle_mt_peer]
Error while reading PMI socket. MPI process died?</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">cp: cannot stat `.in.tmp': No such
file or directory</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px"><br>
</span></div>
<div bgcolor="#FFFFFF" text="#000000" style=""><span
style="font-size:14px">> stop error</span></div>
<div style="font-size:14px">-----------------------------</div>
<div style="font-size:14px">And the .dayfile reads:</div>
<div style="font-size:14px"><br>
</div>
<div style="">
<div style=""><span style="font-size:14px">on z1-17 with PID
29439</span></div>
<div style=""><span style="font-size:14px">using WIEN2k_14.2
(Release 15/10/2014) </span></div>
<div style=""><span style="font-size:14px"><br>
</span></div>
<div style=""><span style="font-size:14px"> start
(Thu Apr 30 17:36:59 2015) with lapw0 (40/99 to go)</span></div>
<div style=""><span style="font-size:14px"><br>
</span></div>
<div style=""><span style="font-size:14px"> cycle 1
(Thu Apr 30 17:36:59 2015) (40/99 to go)</span></div>
<div style=""><span style="font-size:14px"><br>
</span></div>
<div style=""><span style="font-size:14px">> lapw0 -p
(17:36:59) starting parallel lapw0 at Thu Apr 30 </span><span
style="font-size:14px">17:36:59 2015</span></div>
<div style=""><span style="font-size:14px">--------
.machine0 : 32 processors</span></div>
<div style=""><span style="font-size:14px">904.074u 8.710s
1:01.54 1483.2% 0+0k 239608+78400io 105pf+0w</span></div>
<div style=""><span style="font-size:14px">> lapw1 -p
-c (17:38:01) starting parallel lapw1 at Thu Apr
30 17:38:01 2015</span></div>
<div style=""><span style="font-size:14px">-> starting
parallel LAPW1 jobs at Thu Apr 30 17:38:01 2015</span></div>
<div style=""><span style="font-size:14px">running LAPW1 in
parallel mode (using .machines)</span></div>
<div style=""><span style="font-size:14px">1
number_of_parallel_jobs</span></div>
<div style=""><span style="font-size:14px"> z1-17 z1-17
z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17
z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18
z1-18 z1-18 z1-18</span></div>
<div style=""><span style="font-size:14px"> z1-18 z1-18
z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18(8) 469689.261u
1680.003s 8:12:29.52 1595.1% 0+0k 204560+31265944io
366pf+0w</span></div>
<div style=""><span style="font-size:14px"> Summary of
lapw1para:</span></div>
<div style=""><span style="font-size:14px"> z1-17
k=0 user=0 wallclock=0</span></div>
<div style=""><span style="font-size:14px">469788.683u
1726.356s 8:12:31.33 1595.5% 0+0k
206128+31266512io 379pf+0w</span></div>
<div style=""><span style="font-size:14px">> lapw2 -p
-c (01:50:32) running LAPW2 in parallel mode</span></div>
<div style=""><span style="font-size:14px"> z1-17
0.034u 0.040s 1:35.16 0.0% 0+0k 10696+0io 80pf+0w</span></div>
<div style=""><span style="font-size:14px"> Summary of
lapw2para:</span></div>
<div style=""><span style="font-size:14px"> z1-17
user=0.034 wallclock=95.16</span></div>
<div style=""><span style="font-size:14px">** LAPW2
crashed!</span></div>
<div style=""><span style="font-size:14px">4.645u 0.458s
1:42.01 4.9% 0+0k 74792+45008io 133pf+0w</span></div>
<div style=""><span style="font-size:14px">error: command
/home/stretch/flung/DFT/WIEN2k/lapw2cpara -c lapw2.def
failed</span></div>
<div style=""><span style="font-size:14px"><br>
</span></div>
<div style=""><span style="font-size:14px">> stop error</span></div>
</div>
<div style="font-size:14px"><br>
</div>
</div>
<div bgcolor="#FFFFFF" text="#000000" style="font-size:14px">-----------------------------<br>
<br>
>When it runs fine on a single node, does it always use the
same node (say z1-17) or does it run fine on other nodes (like
z1-18)?<br>
<br>
Not really. The nodes were assigned randomly.<br>
</div>
</div>
</blockquote>
</body>
</html>