Hi Professor Blaha,<div><br></div><div>Thanks for the clarification - it wasn't very clear to me that parallelization using mpi doesn't exist. I am using a suitable (local to the disks) $SCRATCH. </div><div><br></div>
<div>I think I have actually discovered what the problem could be. It is apparently connected to how ssh is configured in the system. During one of the runs I kept monitoring what happened to the ssh connections from the master node and saw one hung up. Then I found other users had similar problems in different contexts, and I circumvented most of the problems by setting up a ~/.ssh/config file with the following lines:</div>
<div><br></div><div><div> ConnectionAttempts 300</div><div> ConnectTimeout 3</div><div> TCPKeepAlive yes</div><div> ServerAliveInterval 15</div><div> ServerAliveCountMax 20</div></div><div><br></div><div>which kept the ssh connection alive, and made lots of retries with a smaller timeout period. With these settings, I managed to finish a whole calculation over 15 processors, which would die at some point in the second or third scf cycle. Just posting it in case someone runs over the same problem.</div>
<div><br></div><div>Thanks for yor attention! Now I can start bugging people with more relevant issues :)</div><div><br></div><div>Marcos<br><div class="gmail_quote">On Thu, Aug 5, 2010 at 9:03 AM, Peter Blaha <span dir="ltr"><<a href="mailto:pblaha@theochem.tuwien.ac.at">pblaha@theochem.tuwien.ac.at</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Please read the UG (section about parallelization).<br>
There is no k-parallelzation using mpi.<br>
<br>
PS: Did you set a local SCRATCH directory ? Using a suitable $SCRATCH all<br>
big files should go to a local disk.<br>
PPS: Check the input sections of lapw0,1,2 for the switch to reduce further the<br>
size of the outputX files.<br>
If this does not help, your cluster is an "unusable" machine.<br>
PPPS: If k-point parallel does not work, most likely also mpi will not work, because<br>
also in this case you need to be able to write/read files reliably.<br>
<br>
Marcos Veríssimo Alves schrieb:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div></div><div class="h5">
Hi all,<br>
<br>
The setting up of the .machines file of Wien2K for a parallel run using mpi is not very clear to me... I am searching the list and I do not get to conclusions about it, so I am asking for your help. I'll state my problem as concisely and precisely as I can.<br>
<br>
I am still having problems with running Wien2K parallel over k-points (that is, using ssh/rsh) because our cluster's AFS seems to be really unstable. So I am going to try to compile Wien2K using mvapich, since part of the cluster is interconnected with infiniband.<br>
<br>
Now, the infiniband part of the cluster is composed of 16 identical machines (let's call them machine1...machine16) with 4 cpus each. I would like to run Wien2K in parallel over k-points but using mvapich instead of ssh. The machines are assigned by a queuing system, but I have already easily written a script which reads the machines file assigned by the queuing system, determines the machines assigned, and how many processors of each machine participate in the calculation. I have a number of k-points which is not a multiple of the number of cpus assigned, so I'd like to assign one k-point per processor, and the remaining k-points could either be done fine-grained, or assigned individually.<br>
<br>
To be more precise, suppose I have 32 k-points and the maximum number of processors I got was 9 (because all the others were busy with other users' processes). Supposing that the file with the machines assigned by the queuing system was:<br>
<br>
machine1 (machine1: one processor)<br>
machine2<br>
machine2<br>
machine2 (machine2: three processors)<br>
machine3 (machine3: one processor)<br>
machine4<br>
machine4 (machine4: two processors)<br>
machine5<br>
machine5 (machine5: two prcessors)<br>
<br>
My question is: if all processor have the same speed, would the following .machines file be valid for running processes **only with mpi** (no sending processes over ssh whatsoever)?<br>
<br>
<br>
#<br>
# Hypothetical granularity:1<br>
extrafine:1<br>
1:machine1:4 machine2:12 machine4:2 machine3:3 machine4:6 machine5: 6<br>
<br>
I am so sorry to ask a question which must be extremely basic, but I couldn't find any enlightenment in the list, and I find the example in the manual very confusing... I thank you for any advice you can give me with that respect.<br>
<br>
Best regards,<br>
<br>
Marcos<br>
<br>
<br></div></div>
------------------------------------------------------------------------<br>
<br>
_______________________________________________<br>
Wien mailing list<div class="im"><br>
<a href="mailto:Wien@zeus.theochem.tuwien.ac.at" target="_blank">Wien@zeus.theochem.tuwien.ac.at</a><br>
</div><a href="http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien" target="_blank">http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien</a><br>
</blockquote>
<br>
-- <br>
<br>
P.Blaha<br>
--------------------------------------------------------------------------<br>
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna<br>
Phone: +43-1-58801-15671 FAX: +43-1-58801-15698<br>
Email: <a href="mailto:blaha@theochem.tuwien.ac.at" target="_blank">blaha@theochem.tuwien.ac.at</a> WWW: <a href="http://info.tuwien.ac.at/theochem/" target="_blank">http://info.tuwien.ac.at/theochem/</a><br>
--------------------------------------------------------------------------<br>
<br>
_______________________________________________<br>
Wien mailing list<div class="im"><br>
<a href="mailto:Wien@zeus.theochem.tuwien.ac.at" target="_blank">Wien@zeus.theochem.tuwien.ac.at</a><br>
</div><a href="http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien" target="_blank">http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien</a><br>
</blockquote></div><br></div>