[Wien] Parallel execution on new Intel CPUs
pluto
pluto at physics.ucdavis.edu
Tue Feb 14 11:32:51 CET 2023
Dear Profs. Blaha, Marks,
Thank you for the information!
Could you give an estimate what could be a possible speed-up when I use
mpi parallelization?
My tests on 36-inequivalent-atom slab so far indicate that there is
nearly no difference between different k-parallel and OMP settings. So
far I tried
8x 1:localhost with OMP=2
16x 1:localhost with OMP=1
16x 1:localhost with OMP=2 (means slight overloading)
and the time per SCF cycle (runsp without so) is practically the same in
all these. Later I will also try higher OMP with less 1:localhost, but I
doubt this can possibly be faster.
I have i7-13700K with 64 GB of RAM and NVMe SSD. During 36-atom-slab
parallel calculation around 35 GB is used.
Best,
Lukasz
PS: Now omp_lapwso also works for me in .machines. I think it was a SOC
issue with my test case (which was bulk Au). I am sorry for this
confusion.
On 2023-02-14 10:23, Peter Blaha wrote:
> I have no experience for such a CPU with fast and slow cores.
>
> Simply test it out how you get the fastest turnaround for a fixed
> number of k-points and different number of processes (should be
> compatible with your k-points) and OMP=1-2 (4).
>
> Previously, overloading (using more cores than the physical cores) was
> NOT a good idea, but I don't know how this "fused" CPU behaves. Maybe
> some "small" overloading is ok. This all depends on #-kpoints and
> available cores.
>
> PS:
>
> I cannot verify your omp_lapwso:2 failure. My tests run fine and the
> omp-setting is taken over properly.
>
>
>
>
>> I am now using a machine with i7-13700K. This CPU has 8 performance
>> cores (P-cores) and 8 efficient cores (E-cores). In addition each
>> P-core has 2 threads, so there is 24 threads alltogether. It is hard
>> to find some reasonable info online, but probably a P-core is approx.
>> 2x faster than an E-core:
>> https://www.anandtech.com/show/17047/the-intel-12th-gen-core-i912900k-review-hybrid-performance-brings-hybrid-complexity/10
>> This will of course depend on what is being calculated...
>>
>> Do you have suggestions on how to optimize the .machines file for the
>> parallel execution of an scf cycle?
>>
>> On my machine using OMP_NUM_THREADS leads to oscillations of the CPU
>> use (for a large slab maybe 40% of time is spent on a single thread),
>> suggesting that large OMP is not the optimal strategy.
>>
>> Some examples of strategies:
>>
>> One strategy would be to repeat the line
>> 1:localhost
>> 24 times, to have all the threads busy, and set OMP_NUM_THREADS=1.
>>
>> Another would be set the line
>> 1:localhost
>> 8 times and set OMP_NUM_THREADS=2, this would mean using all 16
>> physical cores.
>>
>> Or perhaps one should better "overload" the CPU e.g. by doing
>> 1:localhost 16 times and OMP=2 ?
>>
>> Over time I will try to benchmark some the different options, but
>> perhaps there is some logic of how one should think about this.
>>
>> In addition I have a comment on .machines file. It seems that for the
>> FM+SOC (runsp -so) calculations the
>>
>> omp_global
>>
>> setting in .machines is ignored. The
>>
>> omp_lapw1
>> omp_lapw2
>>
>> settings seem to work fine. So, I tried to set OMP for lapwso
>> separately, by including the line like:
>>
>> omp_lapwso:2
>>
>> but this gives an error when executing parallel scf.
>>
>> Best,
>> Lukasz
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at:
>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
More information about the Wien
mailing list