[Wien] .machines for several nodes

Mon Oct 12 11:58:22 CEST 2020

Yes, this is ok when your have nodes with 16 cores !!!

(Only the lapw0 line could use :16 instead of 8 if you have 96 atoms, 
but most likely this is fairly negligible).

Yes, the QTL calculation in lapw2 is also affected by the 
parallelization. but it reads from a .processes file, which is created 
by lapw1.

If you run     x lapw2 -p -qtl   in an extra job, you should add the 
following line to create a "correct" .processes file:

x lapw1 -p -d >&/dev/null  # Create .processes (necessary for 
standalone-lapw2)

On 10/12/20 11:45 AM, Christian Søndergaard Pedersen wrote:
> This went a long way towards clearing up my confusion, thanks again. I 
> will try starting an MPI-parallel calculations for 4 nodes with 16 cores 
> each using the following .machines-file:
> 
> 1:g008:16
> 1:g021:16
> 1:g025:16
> 1:g028:16
> lapw0: g008:8 g021:8 g025:8 g028:8
> 
> dstart: g008:8 g021:8 g025:8 g028:8
> 
> 
> ... and see how it performs. If the matrix sizes are small, I understand 
> that I could also have each node work on 2 (or more) k-points at the 
> same time, by specifying:
> 
> 
> 1:g008:8
> 1:g008:8
> 1:g021:8
> 1:g021:8
> 1:g025:8
> 1:g025:8
> 1:g028:8
> 1:g028:8
> 
> so that for instance g008 will work on 2 kpoints using 8 cores for each 
> k point, am I right? And a (hopefully) final question, since qtl 
> according to the manual runs in k-point parallel, is it also affected by 
> the parallellization scheme specified for lapw1 and lapw2 (unless I 
> deliberately change it)?
> 
> 
> 
> ------------------------------------------------------------------------
> *Fra:* Wien <wien-bounces at zeus.theochem.tuwien.ac.at> på vegne af Ruh, 
> Thomas <thomas.ruh at tuwien.ac.at>
> *Sendt:* 12. oktober 2020 10:59:09
> *Til:* A Mailing list for WIEN2k users
> *Emne:* Re: [Wien] .machines for several nodes
> 
> I am afraid, there is still some confusion.
> 
> 
> First about /lapw1/:
> 
> Sorry for my unclear statement - I meant that you need one line per 
> k-parallel job in the sense that #lines k-points are run simultaneously, 
> i. e. if you speficify this part of the machines file like this:
> 
> 
> 1:g008:16
> 
> 1:g021:16
> 
> 1:g025:16
> 
> 1:g028:16
> 
> 
> your k-point list will be split into 4 parts of 56 k-points each [1] , 
> which will be processed step-by-step. Node g008 will work in its first 
> k-point, while node g021 will do the same for its first k-point, and so on
> 
> You need the ":16" after the name of the node. Otherwise, on every node 
> only *one* core would be used. If it is useful to use 16 mpi-parallel 
> jobs per k-point (meaning that the matrices will distributed on 16 cores 
> with each core getting only 1/16 of the matrix elements) depends on your 
> matrix sizes (which in turn depend on your rkmax). You should check that 
> by grepping :rkm in your case.scf file. If the matrix size there is 
> small, using OMP_NUM_THREADS 16 might be much faster (since MPI adds 
> overhead to your calculation).
> 
> 
> 
> Regarding /lapw0/dstart/:
> 
> The way you set the calculation up could lead to (possible severe) 
> overloading of your nodes: WIEN2k will start 24 jobs on each node (so 
> 1.5 times the number of cores) at the same time doing the calculation 
> for 1 atom each.
> 
> As one possible alternative, you specify only 8 cores per node (i.e. for 
> example "lapw0: g008:8" and so on) 8 jobs per node, which would lead to 
> step-by-step calculations for 3 atoms per core.
> 
> Which option is faster is hard to tell and depends a lot on your hardware.
> 
> 
> So what you could do - in principle - is to test multiple configurations 
> (you can modify your .machines file on the fly during a SCF run) in the 
> first cycles, compare the times (in case.dayfile), and use the faster 
> one for the rest of the run.
> 
> 
> 
> Regards,
> Thomas
> 
> 
> [1] Sidenote: This splitting is controlled by the first number - in this 
> case 4 equal sublists will be set-up - you could also specifiy different 
> "weights", for instance, if your nodes are of different speeds, the 
> machinesfile could then read for example:
> 
> 
> 3:g008:16
> 
> 2:g021:16
> 
> 2:g025:16
> 
> 1:g028:16
> 
> 
> In this case, the first node would "get" 3/8 of the k-points (84), nodes 
> g021 and g025 would geht 2/8 each (56), and the last one (because it is 
> very slow) would get only 28 k-points.
> 
> 
> ------------------------------------------------------------------------
> *Von:* Wien <wien-bounces at zeus.theochem.tuwien.ac.at> im Auftrag von 
> Christian Søndergaard Pedersen <chrsop at dtu.dk>
> *Gesendet:* Montag, 12. Oktober 2020 10:24
> *An:* A Mailing list for WIEN2k users
> *Betreff:* Re: [Wien] .machines for several nodes
> 
> Thanks a lot for your answer. After re-reading the relevant pages in the 
> User Guide, I am still left with some questions. Specifically, I am 
> working with a system containing 96 atoms (as described in the 
> case.struct-file) and 224 inequivalent k points; i.e. 500 kpoints 
> distributed as a 7x8x8 grid (448 total) reduced to 224 kpoints. Running 
> on 4 nodes each with 16 cores, I want each of the 4 nodes to calculate 
> 56 k points (224/4 = 56). Meanwhile, each node should handle 24 atoms 
> (96/4 = 24).
> 
> 
> Part of my confusion stems from your suggestion that I repeat the line 
> "1:g008:4 [...]" a number of times equal to the number of k points I 
> want to run in parallel, and that each repetition should refer to a 
> different node. The reason is that the line in question already contains 
> the names of all four nodes that were assigned to the job. However, 
> combining your advice with the example on page 86, the lines should read:
> 
> 
> 1:g008
> 
> 1:g021
> 
> 1:g025
> 
> 1:g028 # k points distributed over 4 jobs, running on 1 node each
> 
> extrafine:1
> 
> 
> As for the parallellization over atoms for dstart and lapw0, I 
> understand that the numbers assigned to each individual node should sum 
> up to the number of atoms in the system, like this:
> 
> 
> dstart:g008:24 g021:24 g025:24 g028:24
> 
> lapw0:g008:24 g021:24 g025:24 g028:24
> 
> 
> so the final .machines-file would be a combination of the above pieces. 
> Have I understood this correctly, or am I missing the mark? Also, is 
> there any difference between distributing the k points across four jobs 
> (1 for each node), and across 224 jobs (by repeating each of the 1:gxxx 
> lines 56 times)?
> 
> 
> Best regards
> 
> Christian
> 
> ------------------------------------------------------------------------
> *Fra:* Wien <wien-bounces at zeus.theochem.tuwien.ac.at> på vegne af Ruh, 
> Thomas <thomas.ruh at tuwien.ac.at>
> *Sendt:* 12. oktober 2020 09:29:37
> *Til:* A Mailing list for WIEN2k users
> *Emne:* Re: [Wien] .machines for several nodes
> 
> Hi,
> 
> 
> your .machines is wrong.
> 
> 
> The nodes for /lapw1 /are prefaced not with "lapw1:" but only with "1:". 
> /lapw2 /needs no line, as it takes the same nodes as lapw1 before.
> 
> 
> So an example for your usecase would be:
> 
> 
> #
> 
> dstart:g008:4 g021:4 g025:4 g028:4
> 
> lapw0:g008:4 g021:4 g025:4 g028:4
> 
> 1:g008:4 g021:4 g025:4 g028:4
> 
> granularity:1
> 
> extrafine:1
> 
> 
> The line starting with "1:" has to be repeated (with different nodes, of 
> course) x times, if you want to run x k-points in parallel (you can find 
> more details about this in the usersguide, pages 84-91).
> 
> 
> Regards,
> 
> Thomas
> 
> 
> PS: As a sidenote: Both /dstart /and /lapw0 /parallelize over atoms, so 
> 16 nodes might not be the best choice for your example.
> 
> ------------------------------------------------------------------------
> *Von:* Wien <wien-bounces at zeus.theochem.tuwien.ac.at> im Auftrag von 
> Christian Søndergaard Pedersen <chrsop at dtu.dk>
> *Gesendet:* Montag, 12. Oktober 2020 09:06
> *An:* wien at zeus.theochem.tuwien.ac.at
> *Betreff:* [Wien] .machines for several nodes
> 
> Hello everybody
> 
> 
> I am new to WIEN2k, and am struggling with parallellizing calculations 
> on our HPC cluster beyond what can be achieved using OMP. In particular, 
> I want to execute run_lapw and/or runsp_lapw running on four identical 
> nodes (16 cores each), parallellizing over k points (unless there's a 
> more efficient scheme). To achieve this, I try to mimic the example from 
> the User Guide (without the extra Alpha node), but my .machines-file 
> does not work the way I intended. This is what I have:
> 
> 
> #
> 
> dstart:g008:4 g021:4 g025:4 g028:4
> 
> lapw0:g008:4 g021:4 g025:4 g028:4
> 
> lapw1:g008:4 g021:4 g025:4 g028:4
> 
> lapw2:g008:4 g021:4 g025:4 g028:4
> 
> granularity:1
> 
> extrafine:1
> 
> 
> The node names gxxx are read from SLURM_JOB_NODELIST in the submit 
> script, and a couple of regular expressions generate the above lines. 
> Afterwards, my job script does the following:
> 
> 
> srun hostname -s > slurm.hosts
> run_lapw -p
> 
> which results in a job that idles for the entire walltime and finishes 
> with a CPU efficiency of 0.00%. I would appreciate any help in figuring 
> out where I've gone wrong.
> 
> 
> Best regards
> Christian
> 
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> 

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/TC_Blaha
--------------------------------------------------------------------------