[Wien] PBS

Laurence Marks L-marks at northwestern.edu
Thu Jan 5 13:17:24 CET 2012


As Florent said, this is a known issue with some (not all) versions of
ssh, and it is also a torque bug. What you have to do is use mpirun
instead of ssh to launch jobs which I think you can do by setting the
MPI_REMOTE/USE_REMOTE switches. I think I posted how to do this some
time ago, so please search the mailing list. (I am in China and can
provide more information next week when I return if this is not
enough, which it probably is not.)

N.B., in case anyone wonders with torque (PBS) you are not "supposed
to" use ssh to communicate the way Wien2k does. They are not going to
move on this so this is "WIen2k's fault". I've looked in to this quite
a bit and there is no solution except to avoid ssh (or live with
zombie processes). Indeed, torque has the weakness of leaving
processes around if a code does anything more adventurous than just
run a single mpirun -- so it goes.

On Thu, Jan 5, 2012 at 3:22 AM, Peter Blaha
<pblaha at theochem.tuwien.ac.at> wrote:
> I've never done this myself, but as far as I know one can define a
> "prolog" script in all those queuing systems and this prolog script
> should ssh to all assigned nodes and kill all remaining jobs of this user.
>
>
> Am 05.01.2012 10:17, schrieb Florent Boucher:
>
>> Dear Yundi,
>> this is a known limitation of ssh and rsh that does not pass the interrupt
>> signal to the remote host.
>> Under LSF I had in the past a solution. It was a specific rshlsf for doing
>> this.
>> Actually I use either SGE or PBS on two different cluster and the problem
>> exists.
>> You will see that are not even able to suspend a running job.
>> If some one has a solution, I will also appreciate.
>> Regards
>> Florent
>>
>> Le 04/01/2012 21:57, Yundi Quan a écrit :
>>>
>>> I'm working on a cluster using torque queue system. I can directly ssh to
>>> any nodes without using password. When I use qdel( or canceljob) jobid to
>>> terminate a running job, the
>>> job will be terminated in the queue system. However, when I ssh to the
>>> nodes, the job are still running. Does anyone know how to avoid this?
>>>
>>>
>>>
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>>
>>
>> --
>>  -------------------------------------------------------------------------
>> | Florent BOUCHER                    |
>>  |
>> | Institut des Matériaux Jean Rouxel |Mailto:Florent.Boucher at cnrs-imn.fr
>>  |
>> | 2, rue de la Houssinière           | Phone: (33) 2 40 37 39 24
>>  |
>> | BP 32229                           | Fax:   (33) 2 40 37 39 95
>>  |
>> | 44322 NANTES CEDEX 3 (FRANCE)      |http://www.cnrs-imn.fr
>>  |
>>  -------------------------------------------------------------------------
>>
>>
>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
>
> --
>
>                                      P.Blaha
> --------------------------------------------------------------------------
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
> Email: blaha at theochem.tuwien.ac.at    WWW:
> http://info.tuwien.ac.at/theochem/
> --------------------------------------------------------------------------
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien



-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what
nobody else has thought"
Albert Szent-Gyorgi


More information about the Wien mailing list