[Wien] Why is "sleep $delay" commented out in lapw1para_lapw?

Peter Blaha pblaha at theochem.tuwien.ac.at
Tue Apr 7 07:51:26 CEST 2015


The answer to your query is not so easy.

Consider the following:

Some clusters have NO probem with the filesystem timing and can handle multiple
logins (ssh) "at the same time", others react slower and need a delay (of up to one second).

At the moment our clusters do not have any problem, therefore I commented this (for test reasons),
but apparently forgot to do the same in lapw2para.

It also depends a lot on the types of jobs you submit:

If you have only a few k-points (eg. 10) and each lapw1 step takes anyway 10 minutes, even
a delay of 1 second does not really matter.

If you have many more k-parallel jobs (100 or more) and each lapw1 step takes only 1 second,
even a delay of 0.1 sec can be unacceptable.

For your case: increase it until you get a stable situation.

Am 06.04.2015 um 21:37 schrieb David Olmsted:
> Laurence,
>    Thank you for the response.  As I mentioned in my first try at this issue last week, I have put the "sleep $delay" back in, and it does seem to have helped.  I think I have less failures when the lapw1 processes are being started.  But I still have some, so I am not certain.  I am still working with the cluster's consultants on this.
>
>    Nonetheless, the other scripts do have "sleep $delay", and at the top of lapw1para_lapw it does say
> #In this section use 0 to turn of an option, 1 to turn it on,
> #respectively choose a value
>
> set useremote   = 1             # using remote shell to launch processes
> set mpiremote   = 1             # using remote shell to launch mpi
> set delay       = 0.1           # delay launching of processes by n seconds
> set sleepy      = 1.0           # additional sleep before checking
> set debug       = 0             # verbosity of debugging output
> set taskset0
> set taskset=no
>
>     Given those two things, it seems to me that it would be more appropriate for the delay to actually exist in lapw1para_lapw.  But not my call.
>
>     Thank you for your help.
>
> Cheers,
>    David
>
>
> -----Original Message-----
> From: wien-bounces at zeus.theochem.tuwien.ac.at [mailto:wien-bounces at zeus.theochem.tuwien.ac.at] On Behalf Of Laurence Marks
> Sent: Monday, April 06, 2015 12:14 PM
> To: A Mailing list for WIEN2k users
> Subject: Re: [Wien] Why is "sleep $delay" commented out in lapw1para_lapw?
>
> Dear David,
>
> I think the answer to your question "why" is "because".
>
> Often for things like this it is some combination of "seat of the pants" gut instinct and KISS. I am not certain why I used 0.25 in my version, and I think I have recently reduced it to 0.1. I will admit that I never tested in great detail whether 0.25 was better or worse, it really will depend heavily upon the cluster.
>
> Similarly I suspect the delay between launching ssh was probably removed as it did not seem to matter. My suggestion would be to put it back and see if it helps.
>
> I agree that it would be better to have this (and various other
> things) set in parallel_options.
>
> Not the most clear answer, sorry.
>
> On Mon, Apr 6, 2015 at 11:32 AM, David Olmsted <olmsted at berkeley.edu> wrote:
>> Hi,
>>
>>     There has been no response to my suggestion that in lapw1para_lapw, the
>> line “#    sleep $delay” be changed to “sleep $delay”.  Perhaps I have not
>> given enough information.  In the userguide there is no mention of “delay”.
>> In the archive I find nothing explaining why the line is commented
>> out.  (Or even explaining that it is commented out.)  In
>> lapw2para_lapw, for example, the “sleep $delay” line is actually in
>> use, rather than commented out.  The same is true in some of the other
>> scripts.  Why the difference in lapw1para_lapw?
>>
>>
>>
>> I am using version 14.2 on a  large linux cluster with TORQUE.   I was using
>> a revised version of a parallel_options file from a post by Lawrence Marks
>> which included “set delay   = 0.25”, and was surprised to discover this did
>> not actually take effect in lapw1para_lapw.
>>
>>
>>
>> Thanks,
>>
>> David
>>
>>
>>
>> David Olmsted
>>
>> Assistant Research Engineer
>>
>> Materials Science and Engineering
>>
>> 210 Hearst Memorial Mining Building
>>
>> University of California
>>
>> Berkeley, CA 94720-1760
>>
>>
>
>
>
> --
> Professor Laurence Marks
> Department of Materials Science and Engineering Northwestern University www.numis.northwestern.edu Corrosion in 4D: MURI4D.numis.northwestern.edu Co-Editor, Acta Cryst A "Research is to see what everybody else has seen, and to think what nobody else has thought"
> Albert Szent-Gyorgi
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>

-- 
-----------------------------------------
Peter Blaha
Inst. Materials Chemistry, TU Vienna
Getreidemarkt 9, A-1060 Vienna, Austria
Tel: +43-1-5880115671
Fax: +43-1-5880115698
email: pblaha at theochem.tuwien.ac.at
-----------------------------------------


More information about the Wien mailing list