[Wien] .machines for several nodes

Thu Oct 15 10:35:36 CEST 2020

Well, 99% cpu efficiency does not mean that you run efficiently, but my 
estimat is that you run at least 2 times slower than what is possible.

Anyway, please save the dayfile and compare the wall time of the 
different parts with a different setup.

At least now we know that you have 24 cores/node. So the lapw0/dstart 
lines are perfectly ok.

However, lapw1 you run on 3 mpi cores. This is "maximally inefficient". 
This gives a division of your matrix into 3x1, but it should be as close 
as possible to an even decomposition. So 4x4=16 or 8x8=64 cores is 
optimal. With your 24 cores and 96 atom/cell I'd probably go for 12 
cores in mpi and 2-kparallel jobs per node:

1:x073:12
1:x082:12
1:x073:12
1:x082:12

Maybe one can even overload the nodes a bit using 16 instead of 12 
cores, but this could be dangerous on some machines because of your 
admins might have forced cpu-binding, .... (You can even change the 
.machines file (12-->16) "by hand" while your job is running (and maybe 
change it back once you have seen whether timing is better or worse).

In any case, compare the timeings in the dayfile in order to find the 
optimal setup.

Am 15.10.2020 um 07:29 schrieb Christian Søndergaard Pedersen:
> Dear Professor Blaha
> 
> 
> Thanks a lot for your responses. I have performed some additional 
> testing, which has been delayed because I cannot run lapw0/1/2 from the 
> command-line due to memory issues. Hence, I have had to go through the 
> queue for each test. On top of that, I have been unable to get 
> information about our installation. However, I finally achieved ~99% CPU 
> efficiency with the following setup:
> 
> 
> CPUs: 2 nodes with 24 cores each (x073 and x082)
> 
> 
> .machines:
> 
> dstart:x073:24 x082:24
> lapw0:x073:24 x082:24
> 1:x073:3
> 1:x082:3
> 1:x073:3
> 1:x082:3
> 1:x073:3
> 1:x082:3
> 1:x073:3
> 1:x082:3  #  16 lines total; 8 for each node
> 1:x073:3
> 1:x082:3
> 1:x073:3
> 1:x082:3
> 1:x073:3
> 1:x082:3
> 1:x073:3
> 1:x082:3
> 
> 
> After creating the .machines-file I call 'mpirun run_lapw -p'. The above 
> .machines file is basically a combination of the two examples found on 
> page 86 of the User's Guide (without using OMP, of course). From 
> checking the case.klist_1-16 files, I have verified that each individual 
> job works on a different subset of the k-points. Can anyone confirm 
> whether this setup is correct; i.e. is it a proper way to parallellize 
> the lapw1/lapw2 cycles? Assuming the compilations of lapw0/1/2_mpi 
> proceeded without errors, which seems to be the case.
> 
> 
> Best regards
> 
> Christian
> 
> ------------------------------------------------------------------------
> *Fra:* Wien <wien-bounces at zeus.theochem.tuwien.ac.at> på vegne af Peter 
> Blaha <pblaha at theochem.tuwien.ac.at>
> *Sendt:* 13. oktober 2020 07:43:16
> *Til:* wien at zeus.theochem.tuwien.ac.at
> *Emne:* Re: [Wien] .machines for several nodes
> To run a single program for testing, do:
> 
> x lapw0 -p
> 
> (after creation of .machines.)
> 
> Then check all error files, but in particular also the slurm-output
> (whatever it is called on your machines. It probably gives some messages
> like library xxxx not found or so, which is needed for additional debugging.
> 
> AND:
> 
> We still don't know how many cores your nodes have
> 
> We still don't know your compiler options (WIEN2k_OPTIONS,
> parallel_options)  and if the compilation of eg. lapw0_mpi did work at
> all (compile.msg in SRC_lapw0).
> 
> Am 12.10.2020 um 22:17 schrieb Christian Søndergaard Pedersen:
>> Dear everybody
>> 
>> 
>> I am following up on this thread to report on two separate errors in my 
>> attempts to properly parallellize a calculation. For the first, a 
>> calculation utilized 0.00% of available CPU resources. My .machines file 
>> looks like this:
>> 
>> 
>> #
>> dstart:g004:8 g010:8 g011:8 g040:8
>> lapw0:g004:8 g010:8 g011:8 g040:8
>> 1:g004:16
>> 1:g010:16
>> 1:g011:16
>> 1:g040:16
>> 
>> With my submit script calling the following commands:
>> 
>> 
>> srun hostname -s > slurm.hosts
>> 
>> run_lapw -p
>> 
>> x qtl -p -telnes
>> 
>> 
>> Of course, the job didn't reach x qtl. The resultant case.dayfile is 
>> short, so I am dumping all of it here:
>> 
>> 
>> Calculating test-machines in /path/to/directory
>> on node.host.name.dtu.dk with PID XXXXX
>> using WIEN2k_19.1 (Release 25/6/2019) in 
>> /path/to/installation/directory/WIEN2k/19.1-intel-2019a
>> 
>> 
>>      start       (Mon Oct 12 19:04:06 CEST 2020) with lapw0 (40/99 to go)
>> 
>>      cycle 1     (Mon Oct 12 19:04:06 CEST 2020)         (40/99 to go)
>> 
>>>   lapw0   -p  (19:04:06) starting parallel lapw0 at Mon Oct 12 19:04:06 CEST 2020
>> -------- .machine0 : 32 processors
>> [1] 16095
>> 
>> 
>> The .machine0 file displays the lines
>> 
>> g004 [repeated for 8 lines]
>> g010 [repeated for 8 lines]
>> g011 [repeated for 8 lines]
>> g040 [repeated for 8 lines]
>> 
>> which tells me that the .machines file works as intended, and that the 
>> cause of the problem is located somewhere else. Which brings me to the 
>> second error, which occured when I tried calling mpirun explicitly like so:
>> 
>> srun hostname -s > slurm.hosts
>> mpirun run_lapw -p
>> mpirun qtl -p -telnes
>> 
>> from within the job script. This crashed the job right away. The 
>> lapw0.error file prints out "Error in Parallel lapw0" and "check ERROR 
>> FILES!" a number of times. The case.clmsum file is present and looks 
>> correct, and the .machines file looks like the one from before (with 
>> different node numbers). However, the .machine0 file now looks like:
>> 
>> g094
>> g094
>> g094
>> g081
>> g081
>> g08g094
>> g094
>> g094
>> g094
>> g094
>> [...]
>> 
>> I.e. there's an error on line 6, where a node is not properly named and 
>> a line break is missing. The dayfile repeatedly prints out "> stop 
>> error" a total of sixteen times. I don't know if the above .machine0 
>> file is the culprit, but it seems the obvious conclusion. Any help in 
>> this matter will be much appreciated.
>> 
>> Best regards
>> Christian
>> 
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>> 
> 
> -- 
> --------------------------------------------------------------------------
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
> Email: blaha at theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
> WWW:
> http://www.imc.tuwien.ac.at/tc_blaha------------------------------------------------------------------------- 
> 
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: 
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> 

-- 
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW: 
http://www.imc.tuwien.ac.at/tc_blaha-------------------------------------------------------------------------