[Wien] SLURM support "no ssh" for WIEN2k?

Laurence Marks L-marks at northwestern.edu
Thu Nov 12 14:49:14 CET 2015


Dear All,

As I am currently trying to get Wien2k running on Stampede (also SLURM),
let me add a little clarification without disagreeing with anything Peter
said.

A typical workflow in Wien2k is (very simplified) an iterative loop
controlled by csh scripts:

1) A single serial multithreaded or mpi task
2) A number of parallel multithreaded or mpi tasks at the same time
3) A single multithreaded task

Wien2k uses a file .machines constructed by the user to control these.
Peter like to do this with csh scripts, I prefer a different approach using
an unsupported utility Machines2W. Both need a list of the nodes available
to the user, for instance with PBS/QSUB this is in PBS_NODEFILE. There are
various ways to generate this with SLURM.

Wien2k does not have any internal code that allows it to interrogate the
batch control system to know whether the .machines file is correct.
Unfortunately on all OS that I can think of it is possible to incorrectly
construct the file .machines and as a consequence oversubscribe nodes.

In principle one can manipulate at the OS level how parallel tasks are
executed, for instance by changing (for mpi) how mpirun bootstraps tasks.
This works fine if users are only going to launch a single program, e.g.

mpirun -np 64 program

when one can replace mpirun with something else. However, this can be
inappropriate for Wien2k and prevent it from working. I have seen many
cases on supercomputers around the world where changes have been made at
the OS level that are specialized and non-standard, and as a consequence
made Wien2k inoperable. I strongly suggest care with customizations.


On Thu, Nov 12, 2015 at 6:42 AM, Peter Blaha <pblaha at theochem.tuwien.ac.at>
wrote:

> Hi,
>
> WIEN2k has a usersguide, where the different parallelization modes are
> extensively described.
>
> On a cluster with a queuing system (like SLURM) it should not even be
> possible to access nodes (except the frontend) via ssh without using
> SLURM (on our SLURM machine ssh is possible only to nodes which are
> assigned to a user by salloc or a sbatch job), thus overloading can be
> prevented.
>
> We ALWAYS run our jobs using SLURM and typically submit them using
>
> sbatch slurm.job
>
> Now you mentioned correctly, that wien2k needs a ".machine" file and
> thus "slurm.job" has to create it on the fly.
>
> I've provided an example script (which you may need to adapt for user or
> resource specifications) at
>
> http://www.wien2k.at/reg_user/faq/pbs.html
>
> One more hint: in the file $WIENROOT/parallel_options one specifies if
> k-point parallel (USE_REMOTE) and mpi-parallel (MPI_REMOTE) jobs are
> started using ssh (1) or not (0).
>
> setenv USE_REMOTE 1 or 0
> setenv MPI_REMOTE 0
>
> On modern mpi-versions use always MPI_REMOTE=0.
> Usually k-point parallelism is meaningful only for up to 8 (then set
> OMP_NUM_THREAD=2) or 16 cores, otherwise the overhead is too big. In
> such cases one would use only ONE node and it runs as a "shared-memory
> machine (USE_REMOTE=0) (without mpi at all).
> On medium sized cases with a few k-points and larger matrices, a "mixed"
> k-parallel and mpi-parallel setup is best and is used in the slurm.job
> example above.
>
>
> PS: I'm sending this also to our WIEN2k-mailing list, because this is of
> general interest and I don't want to write the same email all the time
> again.
> PPS: Please use in general the mailing list (see www.wien2k.at), as I
> normally do not answer questions directly sent to me.
>
> Best regards
> Peter Blaha
>
> On 11/11/2015 07:08 PM, Robb III, George B. wrote:
> > Hi Dr. Schwartz / Dr. Blaha-
> >
> > We have noticed on our SLURM <http://schedmd.com/#index> based research
> > cluster that WIEN2k suite commands take advantage of a .machines file to
> > spawn off ssh sessions to individual nodes vs using a scheduler.
> >
> > We have SLURM configured to control cluster resource allocations and
> > have collisions of resources when ssh processes are called from the
> > WIEN2k suite.
> >
> > e.g. SLURM controls node2 and has 95% allocated resources for jobs, but
> > WIEN2k process is launched from the head node it will ssh to node2 (due
> > to the 5% free resources) and spawn additional un-SLURM-managed
> > processes on node2.  Node2 is now over subscribed.
> >
> > Does the WIEN2k suite have an administrative guide or documentation?
> > Does the WIEN2k suite have recommend cluster manager (i.e. PBS, SLURM,
> > LSF, etc..)?
> >
> > Thanks again for any assistance and looking forward having the labs on
> > our campus use the WIEN2k suite.
> >
> > Thanks,
> >
> > George B. Robb III
> > Systems Administrator
> > Research Computing Support Services - (RCSS)
> > University of Missouri System
>
> --
>
>                                        P.Blaha
> --------------------------------------------------------------------------
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
> Email: blaha at theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
> WWW:   http://www.imc.tuwien.ac.at/staff/tc_group_e.php
> --------------------------------------------------------------------------
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>



-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
Corrosion in 4D: MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
"Research is to see what everybody else has seen, and to think what nobody
else has thought"
Albert Szent-Gyorgi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20151112/9d0e3031/attachment.html>


More information about the Wien mailing list