[Wien] .machines for several nodes
Peter Blaha
pblaha at theochem.tuwien.ac.at
Mon Oct 12 11:58:22 CEST 2020
Yes, this is ok when your have nodes with 16 cores !!!
(Only the lapw0 line could use :16 instead of 8 if you have 96 atoms,
but most likely this is fairly negligible).
Yes, the QTL calculation in lapw2 is also affected by the
parallelization. but it reads from a .processes file, which is created
by lapw1.
If you run x lapw2 -p -qtl in an extra job, you should add the
following line to create a "correct" .processes file:
x lapw1 -p -d >&/dev/null # Create .processes (necessary for
standalone-lapw2)
On 10/12/20 11:45 AM, Christian Søndergaard Pedersen wrote:
> This went a long way towards clearing up my confusion, thanks again. I
> will try starting an MPI-parallel calculations for 4 nodes with 16 cores
> each using the following .machines-file:
>
> 1:g008:16
> 1:g021:16
> 1:g025:16
> 1:g028:16
> lapw0: g008:8 g021:8 g025:8 g028:8
>
> dstart: g008:8 g021:8 g025:8 g028:8
>
>
> ... and see how it performs. If the matrix sizes are small, I understand
> that I could also have each node work on 2 (or more) k-points at the
> same time, by specifying:
>
>
> 1:g008:8
> 1:g008:8
> 1:g021:8
> 1:g021:8
> 1:g025:8
> 1:g025:8
> 1:g028:8
> 1:g028:8
>
> so that for instance g008 will work on 2 kpoints using 8 cores for each
> k point, am I right? And a (hopefully) final question, since qtl
> according to the manual runs in k-point parallel, is it also affected by
> the parallellization scheme specified for lapw1 and lapw2 (unless I
> deliberately change it)?
>
>
>
> ------------------------------------------------------------------------
> *Fra:* Wien <wien-bounces at zeus.theochem.tuwien.ac.at> på vegne af Ruh,
> Thomas <thomas.ruh at tuwien.ac.at>
> *Sendt:* 12. oktober 2020 10:59:09
> *Til:* A Mailing list for WIEN2k users
> *Emne:* Re: [Wien] .machines for several nodes
>
> I am afraid, there is still some confusion.
>
>
> First about /lapw1/:
>
> Sorry for my unclear statement - I meant that you need one line per
> k-parallel job in the sense that #lines k-points are run simultaneously,
> i. e. if you speficify this part of the machines file like this:
>
>
> 1:g008:16
>
> 1:g021:16
>
> 1:g025:16
>
> 1:g028:16
>
>
> your k-point list will be split into 4 parts of 56 k-points each [1] ,
> which will be processed step-by-step. Node g008 will work in its first
> k-point, while node g021 will do the same for its first k-point, and so on
>
> You need the ":16" after the name of the node. Otherwise, on every node
> only *one* core would be used. If it is useful to use 16 mpi-parallel
> jobs per k-point (meaning that the matrices will distributed on 16 cores
> with each core getting only 1/16 of the matrix elements) depends on your
> matrix sizes (which in turn depend on your rkmax). You should check that
> by grepping :rkm in your case.scf file. If the matrix size there is
> small, using OMP_NUM_THREADS 16 might be much faster (since MPI adds
> overhead to your calculation).
>
>
>
> Regarding /lapw0/dstart/:
>
> The way you set the calculation up could lead to (possible severe)
> overloading of your nodes: WIEN2k will start 24 jobs on each node (so
> 1.5 times the number of cores) at the same time doing the calculation
> for 1 atom each.
>
> As one possible alternative, you specify only 8 cores per node (i.e. for
> example "lapw0: g008:8" and so on) 8 jobs per node, which would lead to
> step-by-step calculations for 3 atoms per core.
>
> Which option is faster is hard to tell and depends a lot on your hardware.
>
>
> So what you could do - in principle - is to test multiple configurations
> (you can modify your .machines file on the fly during a SCF run) in the
> first cycles, compare the times (in case.dayfile), and use the faster
> one for the rest of the run.
>
>
>
> Regards,
> Thomas
>
>
> [1] Sidenote: This splitting is controlled by the first number - in this
> case 4 equal sublists will be set-up - you could also specifiy different
> "weights", for instance, if your nodes are of different speeds, the
> machinesfile could then read for example:
>
>
> 3:g008:16
>
> 2:g021:16
>
> 2:g025:16
>
> 1:g028:16
>
>
> In this case, the first node would "get" 3/8 of the k-points (84), nodes
> g021 and g025 would geht 2/8 each (56), and the last one (because it is
> very slow) would get only 28 k-points.
>
>
> ------------------------------------------------------------------------
> *Von:* Wien <wien-bounces at zeus.theochem.tuwien.ac.at> im Auftrag von
> Christian Søndergaard Pedersen <chrsop at dtu.dk>
> *Gesendet:* Montag, 12. Oktober 2020 10:24
> *An:* A Mailing list for WIEN2k users
> *Betreff:* Re: [Wien] .machines for several nodes
>
> Thanks a lot for your answer. After re-reading the relevant pages in the
> User Guide, I am still left with some questions. Specifically, I am
> working with a system containing 96 atoms (as described in the
> case.struct-file) and 224 inequivalent k points; i.e. 500 kpoints
> distributed as a 7x8x8 grid (448 total) reduced to 224 kpoints. Running
> on 4 nodes each with 16 cores, I want each of the 4 nodes to calculate
> 56 k points (224/4 = 56). Meanwhile, each node should handle 24 atoms
> (96/4 = 24).
>
>
> Part of my confusion stems from your suggestion that I repeat the line
> "1:g008:4 [...]" a number of times equal to the number of k points I
> want to run in parallel, and that each repetition should refer to a
> different node. The reason is that the line in question already contains
> the names of all four nodes that were assigned to the job. However,
> combining your advice with the example on page 86, the lines should read:
>
>
> 1:g008
>
> 1:g021
>
> 1:g025
>
> 1:g028 # k points distributed over 4 jobs, running on 1 node each
>
> extrafine:1
>
>
> As for the parallellization over atoms for dstart and lapw0, I
> understand that the numbers assigned to each individual node should sum
> up to the number of atoms in the system, like this:
>
>
> dstart:g008:24 g021:24 g025:24 g028:24
>
> lapw0:g008:24 g021:24 g025:24 g028:24
>
>
> so the final .machines-file would be a combination of the above pieces.
> Have I understood this correctly, or am I missing the mark? Also, is
> there any difference between distributing the k points across four jobs
> (1 for each node), and across 224 jobs (by repeating each of the 1:gxxx
> lines 56 times)?
>
>
> Best regards
>
> Christian
>
> ------------------------------------------------------------------------
> *Fra:* Wien <wien-bounces at zeus.theochem.tuwien.ac.at> på vegne af Ruh,
> Thomas <thomas.ruh at tuwien.ac.at>
> *Sendt:* 12. oktober 2020 09:29:37
> *Til:* A Mailing list for WIEN2k users
> *Emne:* Re: [Wien] .machines for several nodes
>
> Hi,
>
>
> your .machines is wrong.
>
>
> The nodes for /lapw1 /are prefaced not with "lapw1:" but only with "1:".
> /lapw2 /needs no line, as it takes the same nodes as lapw1 before.
>
>
> So an example for your usecase would be:
>
>
> #
>
> dstart:g008:4 g021:4 g025:4 g028:4
>
> lapw0:g008:4 g021:4 g025:4 g028:4
>
> 1:g008:4 g021:4 g025:4 g028:4
>
> granularity:1
>
> extrafine:1
>
>
> The line starting with "1:" has to be repeated (with different nodes, of
> course) x times, if you want to run x k-points in parallel (you can find
> more details about this in the usersguide, pages 84-91).
>
>
> Regards,
>
> Thomas
>
>
> PS: As a sidenote: Both /dstart /and /lapw0 /parallelize over atoms, so
> 16 nodes might not be the best choice for your example.
>
> ------------------------------------------------------------------------
> *Von:* Wien <wien-bounces at zeus.theochem.tuwien.ac.at> im Auftrag von
> Christian Søndergaard Pedersen <chrsop at dtu.dk>
> *Gesendet:* Montag, 12. Oktober 2020 09:06
> *An:* wien at zeus.theochem.tuwien.ac.at
> *Betreff:* [Wien] .machines for several nodes
>
> Hello everybody
>
>
> I am new to WIEN2k, and am struggling with parallellizing calculations
> on our HPC cluster beyond what can be achieved using OMP. In particular,
> I want to execute run_lapw and/or runsp_lapw running on four identical
> nodes (16 cores each), parallellizing over k points (unless there's a
> more efficient scheme). To achieve this, I try to mimic the example from
> the User Guide (without the extra Alpha node), but my .machines-file
> does not work the way I intended. This is what I have:
>
>
> #
>
> dstart:g008:4 g021:4 g025:4 g028:4
>
> lapw0:g008:4 g021:4 g025:4 g028:4
>
> lapw1:g008:4 g021:4 g025:4 g028:4
>
> lapw2:g008:4 g021:4 g025:4 g028:4
>
> granularity:1
>
> extrafine:1
>
>
> The node names gxxx are read from SLURM_JOB_NODELIST in the submit
> script, and a couple of regular expressions generate the above lines.
> Afterwards, my job script does the following:
>
>
> srun hostname -s > slurm.hosts
> run_lapw -p
>
> which results in a job that idles for the entire walltime and finishes
> with a CPU efficiency of 0.00%. I would appreciate any help in figuring
> out where I've gone wrong.
>
>
> Best regards
> Christian
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
--
P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at WIEN2k: http://www.wien2k.at
WWW: http://www.imc.tuwien.ac.at/TC_Blaha
--------------------------------------------------------------------------
More information about the Wien
mailing list