<html>
  <head>
    <meta content="text/html; charset=windows-1252"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">Ok, now it is clear that there is no
      additional error messages.  Unfortunately, I cannot tell
      specifically what went wrong from those error messages.<br>
      <br>
      You might try replacing mpirun with mpirun_rsh.  As you can see at<br>
      <br>
<a class="moz-txt-link-freetext" href="http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2013-May/004402.html">http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2013-May/004402.html</a><br>
      <br>
      they replaced mpirun with mpirun_rsh, and it seems that they found
      a problem with passwordless ssh.  <br>
      <br>
      In your parallel_options file, you might also want to change
      "setenv USE_REMOTE 0" to "setenv USE_REMOTE 1", then try both 0
      and 1 for MPI_REMOTE to check if any of these other configurations
      work or not while using mpirun.<br>
      <br>
      On 5/6/2015 11:23 AM, lung Fermin wrote:<br>
    </div>
    <blockquote
cite="mid:CAFZG4C7WcGmZ9_Z1Tg3+1UHxwucg+O5+rwy-D_T5S7nH_iY2-w@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div><font color="#500050"><span style="font-size:14px">Thanks
              for the reply. Please see below.</span></font></div>
        <div><span class="im" style="font-size:14px"><br>
          </span></div>
        <span class="im" style="font-size:14px">
          <div><br>
          </div>
        </span>
        <div bgcolor="#FFFFFF" text="#000000" style="font-size:14px">>As
          I asked before, did you give us all the error information in
          the case.dayfile and from standard output?  It is not entirely
          clear in your previous posts, but it looks to me that you
          might have only provided information from the case.dayfile and
          the error files (cat *.error), but maybe not from the standard
          output.  Are you still using the PBS script in your old post
          at <a moz-do-not-send="true"
href="http://www.mail-archive.com/wien%40zeus.theochem.tuwien.ac.at/msg11770.html"
            target="_blank">http://www.mail-archive.com/wien%40zeus.theochem.tuwien.ac.at/msg11770.html</a> ? 
          In the script, I can see that the standard output is set to be
          written to a file called wien2k_output.<br>
          <br>
          <br>
          Sorry for the confusion. Yes, I still use the PBS script in
          the above link. The posts before are from the standard outputs
          (wien2k). When using 2 nodes with 32 cores for one k point,
          the standard output gives</div>
        <div bgcolor="#FFFFFF" text="#000000" style="">
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">----------------------------</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">Warning: no access to tty (Bad file
              descriptor).</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">Thus no job control in this shell.</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">z1-17 z1-17 z1-17 z1-17 z1-17 z1-17
              z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17
              z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-1</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">8 z1-18 z1-18 z1-18 z1-18 z1-18
              z1-18 z1-18 z1-18</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">number of processors: 32</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px"> LAPW0 END</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">[16] Failed to dealloc pd (Device
              or resource busy)</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">[0] Failed to dealloc pd (Device or
              resource busy)</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">[17] Failed to dealloc pd (Device
              or resource busy)</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">[2] Failed to dealloc pd (Device or
              resource busy)</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">[18] Failed to dealloc pd (Device
              or resource busy)</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">[1] Failed to dealloc pd (Device or
              resource busy)</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px"> LAPW1 END</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">LAPW2 - FERMI; weighs written</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">[z1-17:mpispawn_0][child_handler]
              MPI process (rank: 0, pid: 28291) terminated with signal 9
              -> abort job</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">[z1-17:mpispawn_0][readline]
              Unexpected End-Of-File on file descriptor 9. MPI process
              died?</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">[z1-17:mpispawn_0][mtpmi_processops]
              Error while reading PMI socket. MPI process died?</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">[z1-17:mpirun_rsh][process_mpispawn_connection]
              mpispawn_0 from node z1-17 aborted: Error while reading a
              PMI socket (4)</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">[z1-18:mpispawn_1][read_size]
              Unexpected End-Of-File on file descriptor 21. MPI process
              died?</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">[z1-18:mpispawn_1][read_size]
              Unexpected End-Of-File on file descriptor 21. MPI process
              died?</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">[z1-18:mpispawn_1][handle_mt_peer]
              Error while reading PMI socket. MPI process died?</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">cp: cannot stat `.in.tmp': No such
              file or directory</span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px"><br>
            </span></div>
          <div bgcolor="#FFFFFF" text="#000000" style=""><span
              style="font-size:14px">>   stop error</span></div>
          <div style="font-size:14px">-----------------------------</div>
          <div style="font-size:14px">And the .dayfile reads:</div>
          <div style="font-size:14px"><br>
          </div>
          <div style="">
            <div style=""><span style="font-size:14px">on z1-17 with PID
                29439</span></div>
            <div style=""><span style="font-size:14px">using WIEN2k_14.2
                (Release 15/10/2014) </span></div>
            <div style=""><span style="font-size:14px"><br>
              </span></div>
            <div style=""><span style="font-size:14px">    start      
                (Thu Apr 30 17:36:59 2015) with lapw0 (40/99 to go)</span></div>
            <div style=""><span style="font-size:14px"><br>
              </span></div>
            <div style=""><span style="font-size:14px">    cycle 1    
                (Thu Apr 30 17:36:59 2015)  (40/99 to go)</span></div>
            <div style=""><span style="font-size:14px"><br>
              </span></div>
            <div style=""><span style="font-size:14px">>   lapw0 -p  
                 (17:36:59) starting parallel lapw0 at Thu Apr 30    </span><span
                style="font-size:14px">17:36:59  2015</span></div>
            <div style=""><span style="font-size:14px">--------
                .machine0 : 32 processors</span></div>
            <div style=""><span style="font-size:14px">904.074u 8.710s
                1:01.54 1483.2% 0+0k 239608+78400io 105pf+0w</span></div>
            <div style=""><span style="font-size:14px">>   lapw1  -p
                  -c      (17:38:01) starting parallel lapw1 at Thu Apr
                30 17:38:01 2015</span></div>
            <div style=""><span style="font-size:14px">->  starting
                parallel LAPW1 jobs at Thu Apr 30 17:38:01 2015</span></div>
            <div style=""><span style="font-size:14px">running LAPW1 in
                parallel mode (using .machines)</span></div>
            <div style=""><span style="font-size:14px">1
                number_of_parallel_jobs</span></div>
            <div style=""><span style="font-size:14px">     z1-17 z1-17
                z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17
                z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18
                z1-18 z1-18 z1-18</span></div>
            <div style=""><span style="font-size:14px"> z1-18 z1-18
                z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18(8) 469689.261u
                1680.003s 8:12:29.52 1595.1%      0+0k 204560+31265944io
                366pf+0w</span></div>
            <div style=""><span style="font-size:14px">   Summary of
                lapw1para:</span></div>
            <div style=""><span style="font-size:14px">   z1-17        
                k=0     user=0  wallclock=0</span></div>
            <div style=""><span style="font-size:14px">469788.683u
                1726.356s 8:12:31.33 1595.5%        0+0k
                206128+31266512io 379pf+0w</span></div>
            <div style=""><span style="font-size:14px">>   lapw2 -p  
                -c       (01:50:32) running LAPW2 in parallel mode</span></div>
            <div style=""><span style="font-size:14px">      z1-17
                0.034u 0.040s 1:35.16 0.0% 0+0k 10696+0io 80pf+0w</span></div>
            <div style=""><span style="font-size:14px">   Summary of
                lapw2para:</span></div>
            <div style=""><span style="font-size:14px">   z1-17        
                user=0.034      wallclock=95.16</span></div>
            <div style=""><span style="font-size:14px">**  LAPW2
                crashed!</span></div>
            <div style=""><span style="font-size:14px">4.645u 0.458s
                1:42.01 4.9%      0+0k 74792+45008io 133pf+0w</span></div>
            <div style=""><span style="font-size:14px">error: command  
                /home/stretch/flung/DFT/WIEN2k/lapw2cpara -c lapw2.def  
                failed</span></div>
            <div style=""><span style="font-size:14px"><br>
              </span></div>
            <div style=""><span style="font-size:14px">>   stop error</span></div>
          </div>
          <div style="font-size:14px"><br>
          </div>
        </div>
        <div bgcolor="#FFFFFF" text="#000000" style="font-size:14px">-----------------------------<br>
          <br>
          >When it runs fine on a single node, does it always use the
          same node (say z1-17) or does it run fine on other nodes (like
          z1-18)?<br>
          <br>
          Not really. The nodes were assigned randomly.<br>
        </div>
      </div>
    </blockquote>
  </body>
</html>