[Wien] .machines for several nodes

Christian Søndergaard Pedersen chrsop at dtu.dk
Mon Oct 12 10:24:11 CEST 2020


Thanks a lot for your answer. After re-reading the relevant pages in the User Guide, I am still left with some questions. Specifically, I am working with a system containing 96 atoms (as described in the case.struct-file) and 224 inequivalent k points; i.e. 500 kpoints distributed as a 7x8x8 grid (448 total) reduced to 224 kpoints. Running on 4 nodes each with 16 cores, I want each of the 4 nodes to calculate 56 k points (224/4 = 56). Meanwhile, each node should handle 24 atoms (96/4 = 24).


Part of my confusion stems from your suggestion that I repeat the line "1:g008:4 [...]" a number of times equal to the number of k points I want to run in parallel, and that each repetition should refer to a different node. The reason is that the line in question already contains the names of all four nodes that were assigned to the job. However, combining your advice with the example on page 86, the lines should read:


1:g008

1:g021

1:g025

1:g028 # k points distributed over 4 jobs, running on 1 node each

extrafine:1


As for the parallellization over atoms for dstart and lapw0, I understand that the numbers assigned to each individual node should sum up to the number of atoms in the system, like this:


dstart:g008:24 g021:24 g025:24 g028:24

lapw0:g008:24 g021:24 g025:24 g028:24


so the final .machines-file would be a combination of the above pieces. Have I understood this correctly, or am I missing the mark? Also, is there any difference between distributing the k points across four jobs (1 for each node), and across 224 jobs (by repeating each of the 1:gxxx lines 56 times)?


Best regards

Christian

________________________________
Fra: Wien <wien-bounces at zeus.theochem.tuwien.ac.at> på vegne af Ruh, Thomas <thomas.ruh at tuwien.ac.at>
Sendt: 12. oktober 2020 09:29:37
Til: A Mailing list for WIEN2k users
Emne: Re: [Wien] .machines for several nodes


Hi,


your .machines is wrong.


The nodes for lapw1 are prefaced not with "lapw1:" but only with "1:". lapw2 needs no line, as it takes the same nodes as lapw1 before.


So an example for your usecase would be:


#

dstart:g008:4 g021:4 g025:4 g028:4

lapw0:g008:4 g021:4 g025:4 g028:4

1:g008:4 g021:4 g025:4 g028:4

granularity:1

extrafine:1


The line starting with "1:" has to be repeated (with different nodes, of course) x times, if you want to run x k-points in parallel (you can find more details about this in the usersguide, pages 84-91).


Regards,

Thomas


PS: As a sidenote: Both dstart and lapw0 parallelize over atoms, so 16 nodes might not be the best choice for your example.

________________________________
Von: Wien <wien-bounces at zeus.theochem.tuwien.ac.at> im Auftrag von Christian Søndergaard Pedersen <chrsop at dtu.dk>
Gesendet: Montag, 12. Oktober 2020 09:06
An: wien at zeus.theochem.tuwien.ac.at
Betreff: [Wien] .machines for several nodes


Hello everybody


I am new to WIEN2k, and am struggling with parallellizing calculations on our HPC cluster beyond what can be achieved using OMP. In particular, I want to execute run_lapw and/or runsp_lapw running on four identical nodes (16 cores each), parallellizing over k points (unless there's a more efficient scheme). To achieve this, I try to mimic the example from the User Guide (without the extra Alpha node), but my .machines-file does not work the way I intended. This is what I have:


#

dstart:g008:4 g021:4 g025:4 g028:4

lapw0:g008:4 g021:4 g025:4 g028:4

lapw1:g008:4 g021:4 g025:4 g028:4

lapw2:g008:4 g021:4 g025:4 g028:4

granularity:1

extrafine:1


The node names gxxx are read from SLURM_JOB_NODELIST in the submit script, and a couple of regular expressions generate the above lines. Afterwards, my job script does the following:


srun hostname -s > slurm.hosts
run_lapw -p

which results in a job that idles for the entire walltime and finishes with a CPU efficiency of 0.00%. I would appreciate any help in figuring out where I've gone wrong.


Best regards
Christian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20201012/5f17442b/attachment.htm>


More information about the Wien mailing list