[Wien] benchmark test withi9-12900k

Tue Jan 31 15:16:40 CET 2023

Hi Sandeep, 
> 
> I have a query regarding this.
> While performing serial or parallel calculations, on increasing  omp
> from 1 to 8 , %age use of cpu's does not increase in the same scale
> (omp=2, 170to 180% , omp=4 ,300 to 330%  omp=8 only 500 to 550%).
> is something wrong in configuring or compiling the softwares or due
> to some limitations in hardware.
> Any suggestions?

There are several factors, one is the threading support in the
BLAS/LAPACK libraries and another one are the deficiencies of the
Wien2k OpenMP parallelization. HW also comes into play, mostly in the
general sense that the lower memory bandwidth you have the earlier you
will see the flattening of the speedup with more threads.

If you look at the lawp1 output you can see how the total time is
mostly divided in 3 parts, for example:
      TIME HAMILT (CPU)  =     2.8, HNS =     2.9, HORB =     0.0, DIAG
=    17.3, SYNC =     0.0
       TIME HAMILT (WALL) =     0.7, HNS =     0.8, HORB =     0.0,
DIAG =     4.7, SYNC =     0.0

scaling of DIAG part is mostly based on how your libraries scale (MKL
does quite OK, but don't expect miracles). 

HAMILT scaling is based on explicit Wien2k parallelization. That one
also doesn't scale too well past 4-6 cores. The reason is I was mostly
learning OpenMP when I wrote it and I just went for the simplest "omp
parallel for" solution probably at too high level (also because the
support in ifort of higher OpenMP version with more advanced constructs
was not so good at that time). I think that there could still be some
speedup if this would be rewritten and the parallelization would happen
at different level, maybe more similarly to how its parallelized with
MPI so it fits better in the caches and could thus overcome the memory
bandwidth limits better when scaling to more cores.

HNS has no explicit threading at all and IIRC for the BLAS/LAPACK calls
there the library-level threadidng didn't help much. This could be also
improved by rewriting it to be more parallalization friendly (possibly
again mirroring how the MPI version does it, which scales fine IIRC),
but I'm not algebra expert so I haven't even tried.

So yeah, no easy way how this can be improved, unless you know a bit
about OpenMP and want to try yourself (BTW prof. Blaha was always very
welcoming to contributions even though I'm not part of the Wien2k team
:-) ). 

Best regards
Pavel