[Wien] Doubt in mpi running of Wien2K

Marcos Veríssimo Alves marcos.verissimo.alves at gmail.com
Thu Aug 5 09:20:55 CEST 2010


Hi Professor Blaha,

Thanks for the clarification - it wasn't very clear to me that
parallelization using mpi doesn't exist. I am using a suitable (local to the
disks) $SCRATCH.

I think I have actually discovered what the problem could be. It is
apparently connected to how ssh is configured in the system. During one of
the runs I kept monitoring what happened to the ssh connections from the
master node and saw one hung up. Then I found other users had similar
problems in different contexts, and I circumvented most of the problems by
setting up a ~/.ssh/config file with the following lines:

    ConnectionAttempts 300
    ConnectTimeout 3
    TCPKeepAlive yes
    ServerAliveInterval 15
    ServerAliveCountMax 20

which kept the ssh connection alive, and made lots of retries with a smaller
timeout period. With these settings, I managed to finish a whole calculation
over 15 processors, which would die at some point in the second or third scf
cycle. Just posting it in case someone runs over the same problem.

Thanks for yor attention! Now I can start bugging people with more relevant
issues :)

Marcos
On Thu, Aug 5, 2010 at 9:03 AM, Peter Blaha <pblaha at theochem.tuwien.ac.at>wrote:

> Please read the UG (section about parallelization).
> There is no k-parallelzation using mpi.
>
> PS: Did you set a local SCRATCH directory ? Using a suitable $SCRATCH all
> big files should go to a local disk.
> PPS: Check the input sections of lapw0,1,2 for the switch to reduce further
> the
> size of the outputX files.
> If this does not help, your cluster is an "unusable" machine.
> PPPS: If k-point parallel does not work, most likely also mpi will not
> work, because
> also in this case you need to be able to write/read files reliably.
>
> Marcos Veríssimo Alves schrieb:
>
>> Hi all,
>>
>> The setting up of the .machines file of Wien2K for a parallel run using
>> mpi is not very clear to me... I am searching the list and I do not get to
>> conclusions about it, so I am asking for your help. I'll state my problem as
>> concisely and precisely as I can.
>>
>> I am still having problems with running Wien2K parallel over k-points
>> (that is, using ssh/rsh) because our cluster's AFS seems to be really
>> unstable. So I am going to try to compile Wien2K using mvapich, since part
>> of the cluster is interconnected with infiniband.
>>
>> Now, the infiniband part of the cluster is composed of 16 identical
>> machines (let's call them machine1...machine16) with 4 cpus each. I would
>> like to run Wien2K in parallel over k-points but using mvapich instead of
>> ssh. The machines are assigned by a queuing system, but I have already
>> easily written a script which reads the machines file assigned by the
>> queuing system, determines the machines assigned, and how many processors of
>> each machine participate in the calculation. I have a number of k-points
>> which is not a multiple of the number of cpus assigned, so I'd like to
>> assign one k-point per processor, and the remaining k-points could either be
>> done fine-grained, or assigned individually.
>>
>> To be more precise, suppose I have 32 k-points and the maximum number of
>> processors I got was 9 (because all the others were busy with other users'
>> processes). Supposing that the file with the machines assigned by the
>> queuing system was:
>>
>> machine1            (machine1: one processor)
>> machine2
>> machine2
>> machine2            (machine2: three processors)
>> machine3            (machine3: one processor)
>> machine4
>> machine4            (machine4: two processors)
>> machine5
>> machine5            (machine5: two prcessors)
>>
>> My question is: if all processor have the same speed, would the following
>> .machines file be valid for running processes **only with mpi** (no sending
>> processes over ssh whatsoever)?
>>
>>
>> #
>> # Hypothetical granularity:1
>> extrafine:1
>> 1:machine1:4 machine2:12 machine4:2 machine3:3 machine4:6 machine5: 6
>>
>> I am so sorry to ask a question which must be extremely basic, but I
>> couldn't find any enlightenment in the list, and I find the example in the
>> manual very confusing... I thank you for any advice you can give me with
>> that respect.
>>
>> Best regards,
>>
>> Marcos
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Wien mailing list
>>
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>
> --
>
>                                      P.Blaha
> --------------------------------------------------------------------------
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-15671             FAX: +43-1-58801-15698
> Email: blaha at theochem.tuwien.ac.at    WWW:
> http://info.tuwien.ac.at/theochem/
> --------------------------------------------------------------------------
>
> _______________________________________________
> Wien mailing list
>
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20100805/d9a252d2/attachment.htm>


More information about the Wien mailing list