[Wien] PBS

Florent Boucher Florent.Boucher at cnrs-imn.fr
Fri Jan 6 16:35:45 CET 2012


Dear Laurence,
your last lines are exactly what we need !
Thank you for this.
> set remote = "/bin/csh $WIENROOT/pbsh"
>
> $WIENROOT/pbsh is just
> mpirun -x LD_LIBRARY_PATH -x PATH -np 1 --host $1 /bin/csh -c " $2 "
I will try but I pretty sure that it will work fine.
Regards
Florent

Le 05/01/2012 20:16, Laurence Marks a écrit :
> I gave a slightly jetlagged response -- for certain WIEN2k style works
> fine with all queuing systems.
>
> But...it may not fit how the queuing system has been designed and
> admins may not be accomodating. My understanding (second hand) is that
> torque is designed to work well with openmpi for accounting, and by
> default knows nothing about tasks created by ssh. When the users time
> has elapsed it will terminate those tasks it knows about (the main one
> plus anything using mpirun) and ignore anything else. Hence for
> clusters where killing a ssh on node A does not propogate a kill to
> children on node B (which depends upon the ssh) one is left with
> processes that can run forever. There is something called an epilog
> script which maybe can do this, but it would need WIEN2k to create one
> every time it launches a set of tasks. Possible, but not trivial.
>
> Note: this is not just a WIEN2k problem. One of the admin's at NU
> large cluster is a friend and he tells me that every now an then he
> goes around and tries to clean up tasks left running like this on
> nodes from all sorts of software. Sometimes he has to reboot nodes
> since if torque believes there is nothing running on a node it will
> merrily create more tasks on it which can lead to heavy
> oversubscription and hang the node.
>
> And...just to make life more fun, torque knows nothing about MKL
> threading so on an 8-core node can easily start 8 different non-mpi
> jobs and if they all want 8 threads...
>
> Probably too long a response. Below is the parallel_options file that
> I use on a system with moab (similar, perhaps worse than pbs) where I
> try and be a "gentleman" and set the mkl threading as well as use
> miprun to launch tasks.
>
> setenv USE_REMOTE 1
> setenv MPI_REMOTE 0
> setenv WIEN_GRANULARITY 1
> setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH -np _NP_
> -machinefile _HOSTS_ _EXEC_"
> set a=`grep -e "1:" .machines | grep -v lapw0 | head -1 | cut -f 3 -d:
> | cut -c 1-2`
> setenv MKL_NUM_THREADS $a
> setenv OMP_NUM_THREADS $a
> setenv MKL_DYNAMIC FALSE
> if (-e local_options ) source local_options
> set remote = "/bin/csh $WIENROOT/pbsh"
> set delay   = 0.25
>
> $WIENROOT/pbsh is just
> mpirun -x LD_LIBRARY_PATH -x PATH -np 1 --host $1 /bin/csh -c " $2 "
>
> With this at least I don't create problems (hopefully).
>
> On Thu, Jan 5, 2012 at 7:19 AM, Peter Blaha
> <pblaha at theochem.tuwien.ac.at>  wrote:
>> It is NOT true that queuing systems cannot do the "WIEN2k style".
>>
>> We have two big clusters and run on them all three types of jobs,
>> i) only ssh (k-parallel), ii) only mpi-parallel (no mpi) and also
>> of mixed type.
>>
>> And of course the administrators configured the "sun grid engine" so that it
>> makes sure that there are no processes running when a job finishes and
>> eventually
>> kill all processes of a batch job on all the assigned nodes after it has
>> finished.
>>
>> It's just a matter if the system programmers are willing (or able ??) to
>> reconfigure
>> the queuing system...
>>
>> PS: If you are running mpi-parallel   use    setenv MPI_REMOTE 0 in
>> $WIENROOT/parallel_options and ssh will not be used anyway.
>>
>> Am 05.01.2012 13:17, schrieb Laurence Marks:
>>> As Florent said, this is a known issue with some (not all) versions ofssh,
>>> and it is also a torque bug. What you have to do is use mpiruninstead of ssh
>>> to launch jobs which I think you can do by setting theMPI_REMOTE/USE_REMOTE
>>> switches. I think I posted how to do this sometime ago, so please search the
>>> mailing list. (I am in China and canprovide more information next week when
>>> I return if this is notenough, which it probably is not.)
>>> N.B., in case anyone wonders with torque (PBS) you are not "supposedto"
>>> use ssh to communicate the way Wien2k does. They are not going tomove on
>>> this so this is "WIen2k's fault". I've looked in to this quitea bit and
>>> there is no solution except to avoid ssh (or live withzombie processes).
>>> Indeed, torque has the weakness of leavingprocesses around if a code does
>>> anything more adventurous than justrun a single mpirun -- so it goes.
>>> On Thu, Jan 5, 2012 at 3:22 AM, Peter Blaha<pblaha at theochem.tuwien.ac.at>
>>>   wrote:>    I've never done this myself, but as far as I know one can define
>>> a>    "prolog" script in all those queuing systems and this prolog script>
>>>   should ssh to all assigned nodes and kill all remaining jobs of this
>>> user.>>>    Am 05.01.2012 10:17, schrieb Florent Boucher:>>>    Dear Yundi,>>
>>>   this is a known limitation of ssh and rsh that does not pass the
>>> interrupt>>    signal to the remote host.>>    Under LSF I had in the past a
>>> solution. It was a specific rshlsf for doing>>    this.>>    Actually I use
>>> either SGE or PBS on two different cluster and the problem>>    exists.>>    You
>>> will see that are not even able to suspend a running job.>>    If some one has
>>> a solution, I will also appreciate.>>    Regards>>    Florent>>>>    Le 04/01/2012
>>> 21:57, Yundi Quan a écrit :>>>>>>    I'm working on a cluster using torque
>>> queue system. I can directly ssh to>>>    any nodes without using password.
>>> When I use qdel( or canceljob) j
>> obid to>>>    terminate a running job, the>>>    job will be terminated in the
>> queue system. However, when I ssh to the>>>    nodes, the job are still
>> running. Does anyone know how to avoid this?>>>>>>>>>>>>
>>   _______________________________________________>>>    Wien mailing list>>>
>>   Wien at zeus.theochem.tuwien.ac.at>>>
>>   http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>>>>>>>>    -->>
>>   ------------------------------------------------------------------------->>
>>   | Florent BOUCHER                    |>>      |>>    | Institut des Matériaux
>> Jean Rouxel |Mailto:Florent.Boucher at cnrs-imn.fr>>      |>>    | 2, rue de la
>> Houssinière           | Phone: (33) 2 40 37 39 24>>      |>>    | BP 32229
>>                      | Fax:   (33) 2 40 37 39 95>>      |>>    | 44322 NANTES
>> CEDEX 3 (FRANCE)      |http://www.cnrs-imn.fr>>      |>>
>>   ------------------------------------------------------------------------->>>>>>>>
>>   _______________________________________________>>    Wien mailing list>>
>>   Wien at zeus.theoc
>> hem.tuwien.ac.at>>
>>   http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>>>    -->>
>>                               P.Blaha>
>>   -------------------------------------------------------------------------->
>>   Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna>    Phone:
>> +43-1-58801-165300             FAX: +43-1-58801-165982>    Email:
>> blaha at theochem.tuwien.ac.at    WWW:>    http://info.tuwien.ac.at/theochem/>
>>   -------------------------------------------------------------------------->>>
>>   _______________________________________________>    Wien mailing list>
>>   Wien at zeus.theochem.tuwien.ac.at>
>>   http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>
>>>
>>> -- Professor Laurence MarksDepartment of Materials Science and
>>> EngineeringNorthwestern Universitywww.numis.northwestern.edu
>>> 1-847-491-3996"Research is to see what everybody else has seen, and to think
>>> whatnobody else has thought"Albert
>>> Szent-Gyorgi_______________________________________________Wien mailing
>>> listWien at zeus.theochem.tuwien.ac.athttp://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>> --
>>
>>                                       P.Blaha
>> --------------------------------------------------------------------------
>> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
>> Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
>> Email: blaha at theochem.tuwien.ac.at    WWW:
>> http://info.tuwien.ac.at/theochem/
>> --------------------------------------------------------------------------
>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
>



-- 
  -------------------------------------------------------------------------
| Florent BOUCHER                    |                                    |
| Institut des Matériaux Jean Rouxel | Mailto:Florent.Boucher at cnrs-imn.fr |
| 2, rue de la Houssinière           | Phone: (33) 2 40 37 39 24          |
| BP 32229                           | Fax:   (33) 2 40 37 39 95          |
| 44322 NANTES CEDEX 3 (FRANCE)      | http://www.cnrs-imn.fr             |
  -------------------------------------------------------------------------



More information about the Wien mailing list