[Wien] Parallel execution on new Intel CPUs

Tue Feb 14 12:00:59 CET 2023

How many k-points do you have ? (And how many cores in total ?)

The number of lines (8 or 16) needs to be "compatible" with the number 
of k-points. I have no experience how the memory-bus of this cpu is and 
how "equal" the load is distributed. You need to check the dayfile and 
check if eg. all 16 parallel lapw1 jobs finished at about the same time, 
or 8 run much longer then the other set.

The mpi-code can be quite efficient and for medium sized cases of 
similar speed, but for this it is mandatory to install the ELPA library.
For large cases you usually have only few k-points and clearly only with 
mpi you can use many cores/cpus. For a 36-atom slab I probably would not 
run the regular scf cycle with more than 16 k-points in the IBZ (at 
least if it is insulating) and thus mpi gives a chance to speed-up things.

Again, I do not know what 16 mpi-jobs do, if 8 cores are fast and 8 are 
slow ?

Am 14.02.2023 um 11:32 schrieb pluto via Wien:
> Dear Profs. Blaha, Marks,
> 
> Thank you for the information!
> 
> Could you give an estimate what could be a possible speed-up when I use 
> mpi parallelization?
> 
> My tests on 36-inequivalent-atom slab so far indicate that there is 
> nearly no difference between different k-parallel and OMP settings. So 
> far I tried
> 
> 8x 1:localhost with OMP=2
> 16x 1:localhost with OMP=1
> 16x 1:localhost with OMP=2 (means slight overloading)
> 
> and the time per SCF cycle (runsp without so) is practically the same in 
> all these. Later I will also try higher OMP with less 1:localhost, but I 
> doubt this can possibly be faster.
> 
> I have i7-13700K with 64 GB of RAM and NVMe SSD. During 36-atom-slab 
> parallel calculation around 35 GB is used.
> 
> Best,
> Lukasz
> 
> PS: Now omp_lapwso also works for me in .machines. I think it was a SOC 
> issue with my test case (which was bulk Au). I am sorry for this confusion.
> 
> 
> 
> 
> On 2023-02-14 10:23, Peter Blaha wrote:
>> I have no experience for such a CPU with fast and slow cores.
>>
>> Simply test it out how you get the fastest turnaround for a fixed
>> number of k-points and different number of processes (should be
>> compatible with your k-points) and OMP=1-2 (4).
>>
>> Previously, overloading (using more cores than the physical cores) was
>> NOT a good idea, but I don't know how this "fused" CPU behaves. Maybe
>> some "small" overloading is ok. This all depends on #-kpoints and
>> available cores.
>>
>> PS:
>>
>> I cannot verify your omp_lapwso:2 failure. My tests run fine and the
>> omp-setting is taken over properly.
>>
>>
>>
>>
>>> I am now using a machine with i7-13700K. This CPU has 8 performance 
>>> cores (P-cores) and 8 efficient cores (E-cores). In addition each 
>>> P-core has 2 threads, so there is 24 threads alltogether. It is hard 
>>> to find some reasonable info online, but probably a P-core is approx. 
>>> 2x faster than an E-core:
>>> https://www.anandtech.com/show/17047/the-intel-12th-gen-core-i912900k-review-hybrid-performance-brings-hybrid-complexity/10 This will of course depend on what is being calculated...
>>>
>>> Do you have suggestions on how to optimize the .machines file for the 
>>> parallel execution of an scf cycle?
>>>
>>> On my machine using OMP_NUM_THREADS leads to oscillations of the CPU 
>>> use (for a large slab maybe 40% of time is spent on a single thread), 
>>> suggesting that large OMP is not the optimal strategy.
>>>
>>> Some examples of strategies:
>>>
>>> One strategy would be to repeat the line
>>> 1:localhost
>>> 24 times, to have all the threads busy, and set OMP_NUM_THREADS=1.
>>>
>>> Another would be set the line
>>> 1:localhost
>>> 8 times and set OMP_NUM_THREADS=2, this would mean using all 16 
>>> physical cores.
>>>
>>> Or perhaps one should better "overload" the CPU e.g. by doing 
>>> 1:localhost 16 times and OMP=2 ?
>>>
>>> Over time I will try to benchmark some the different options, but 
>>> perhaps there is some logic of how one should think about this.
>>>
>>> In addition I have a comment on .machines file. It seems that for the 
>>> FM+SOC (runsp -so) calculations the
>>>
>>> omp_global
>>>
>>> setting in .machines is ignored. The
>>>
>>> omp_lapw1
>>> omp_lapw2
>>>
>>> settings seem to work fine. So, I tried to set OMP for lapwso 
>>> separately, by including the line like:
>>>
>>> omp_lapwso:2
>>>
>>> but this gives an error when executing parallel scf.
>>>
>>> Best,
>>> Lukasz
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>> SEARCH the MAILING-LIST at: 
>>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

-- 
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300
Email: peter.blaha at tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at
-------------------------------------------------------------------------