[Wien] Parallel execution on new Intel CPUs

Tue Feb 14 11:32:51 CET 2023

Dear Profs. Blaha, Marks,

Thank you for the information!

Could you give an estimate what could be a possible speed-up when I use 
mpi parallelization?

My tests on 36-inequivalent-atom slab so far indicate that there is 
nearly no difference between different k-parallel and OMP settings. So 
far I tried

8x 1:localhost with OMP=2
16x 1:localhost with OMP=1
16x 1:localhost with OMP=2 (means slight overloading)

and the time per SCF cycle (runsp without so) is practically the same in 
all these. Later I will also try higher OMP with less 1:localhost, but I 
doubt this can possibly be faster.

I have i7-13700K with 64 GB of RAM and NVMe SSD. During 36-atom-slab 
parallel calculation around 35 GB is used.

Best,
Lukasz

PS: Now omp_lapwso also works for me in .machines. I think it was a SOC 
issue with my test case (which was bulk Au). I am sorry for this 
confusion.

On 2023-02-14 10:23, Peter Blaha wrote:
> I have no experience for such a CPU with fast and slow cores.
> 
> Simply test it out how you get the fastest turnaround for a fixed
> number of k-points and different number of processes (should be
> compatible with your k-points) and OMP=1-2 (4).
> 
> Previously, overloading (using more cores than the physical cores) was
> NOT a good idea, but I don't know how this "fused" CPU behaves. Maybe
> some "small" overloading is ok. This all depends on #-kpoints and
> available cores.
> 
> PS:
> 
> I cannot verify your omp_lapwso:2 failure. My tests run fine and the
> omp-setting is taken over properly.
> 
> 
> 
> 
>> I am now using a machine with i7-13700K. This CPU has 8 performance 
>> cores (P-cores) and 8 efficient cores (E-cores). In addition each 
>> P-core has 2 threads, so there is 24 threads alltogether. It is hard 
>> to find some reasonable info online, but probably a P-core is approx. 
>> 2x faster than an E-core:
>> https://www.anandtech.com/show/17047/the-intel-12th-gen-core-i912900k-review-hybrid-performance-brings-hybrid-complexity/10 
>> This will of course depend on what is being calculated...
>> 
>> Do you have suggestions on how to optimize the .machines file for the 
>> parallel execution of an scf cycle?
>> 
>> On my machine using OMP_NUM_THREADS leads to oscillations of the CPU 
>> use (for a large slab maybe 40% of time is spent on a single thread), 
>> suggesting that large OMP is not the optimal strategy.
>> 
>> Some examples of strategies:
>> 
>> One strategy would be to repeat the line
>> 1:localhost
>> 24 times, to have all the threads busy, and set OMP_NUM_THREADS=1.
>> 
>> Another would be set the line
>> 1:localhost
>> 8 times and set OMP_NUM_THREADS=2, this would mean using all 16 
>> physical cores.
>> 
>> Or perhaps one should better "overload" the CPU e.g. by doing 
>> 1:localhost 16 times and OMP=2 ?
>> 
>> Over time I will try to benchmark some the different options, but 
>> perhaps there is some logic of how one should think about this.
>> 
>> In addition I have a comment on .machines file. It seems that for the 
>> FM+SOC (runsp -so) calculations the
>> 
>> omp_global
>> 
>> setting in .machines is ignored. The
>> 
>> omp_lapw1
>> omp_lapw2
>> 
>> settings seem to work fine. So, I tried to set OMP for lapwso 
>> separately, by including the line like:
>> 
>> omp_lapwso:2
>> 
>> but this gives an error when executing parallel scf.
>> 
>> Best,
>> Lukasz
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at: 
>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html