[Wien] Installation with MPI and GNU compilers

Pavel Ondračka pavel.ondracka at email.cz
Thu Apr 5 12:18:39 CEST 2018


Laurence Marks píše v St 04. 04. 2018 v 16:01 +0000:
> I confess to being rather doubtful that gfortran+... is comparable to
> ifort+... for Intel cpu, it might be for AMD. While the mkl vector
> libraries are useful in a few codes such as aim, they are minor for
> the main lapw[0-2].

Well, some fast benchmark data then (serial benchmark single core):
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (haswell)
Wien2k 17.1

-------------

gfortran 7.3.1 + OPENBLAS 0.2.20 + glibc 2.26 (with the custom patch to
use libmvec):

Time for al,bl    (hamilt, cpu/wall) :          0.2         0.2
Time for legendre (hamilt, cpu/wall) :          0.1         0.2
Time for phase    (hamilt, cpu/wall) :          1.2         1.2
Time for us       (hamilt, cpu/wall) :          1.2         1.2
Time for overlaps (hamilt, cpu/wall) :          2.6         2.8
Time for distrib  (hamilt, cpu/wall) :          0.1         0.1
Time sum iouter   (hamilt, cpu/wall) :          5.5         5.8
 number of local orbitals, nlo (hamilt)      304
       allocate YL           2.5
MB          dimensions    15  3481     3
       allocate phsc         0.1 MB          dimensions  3481
Time for los      (hamilt, cpu/wall) :          0.4         0.3
Time for alm         (hns) :          0.1
Time for vector      (hns) :          0.3
Time for vector2     (hns) :          0.3
Time for VxV         (hns) :          2.1
Wall Time for VxV    (hns) :          0.1
         245  Eigenvalues computed 
 Seclr4(Cholesky complete (CPU)) :               1.380     40754.14
Mflops
 Seclr4(Transform to eig.problem (CPU)) :        4.470     37745.44
Mflops
 Seclr4(Compute eigenvalues (CPU)) :            12.750     17643.13
Mflops
 Seclr4(Backtransform (CPU)) :                   0.290     10237.08
Mflops
       TIME HAMILT (CPU)  =     5.8, HNS =     2.5, HORB =     0.0,
DIAG =    18.9
       TIME HAMILT (WALL) =     6.1, HNS =     2.5, HORB =     0.0,
DIAG =    19.0

real	0m28.610s
user	0m27.817s
sys	0m0.394s

-----------

Ifort 17.0.0 + MKL 2017.0:

Time for al,bl    (hamilt, cpu/wall) :          0.2         0.2
Time for legendre (hamilt, cpu/wall) :          0.1         0.2
Time for phase    (hamilt, cpu/wall) :          1.2         1.3
Time for us       (hamilt, cpu/wall) :          1.0         1.0
Time for overlaps (hamilt, cpu/wall) :          2.6         2.8
Time for distrib  (hamilt, cpu/wall) :          0.1         0.1
Time sum iouter   (hamilt, cpu/wall) :          5.4         5.6
 number of local orbitals, nlo (hamilt)      304
       allocate YL           2.5
MB          dimensions    15  3481     3
       allocate phsc         0.1 MB          dimensions  3481
Time for los      (hamilt, cpu/wall) :          0.2         0.2
Time for alm         (hns) :          0.0
Time for vector      (hns) :          0.4
Time for vector2     (hns) :          0.4
Time for VxV         (hns) :          2.1
Wall Time for VxV    (hns) :          0.1
         245  Eigenvalues computed 
 Seclr4(Cholesky complete (CPU)) :               1.110     50667.31
Mflops
 Seclr4(Transform to eig.problem (CPU)) :        3.580     47129.09
Mflops
 Seclr4(Compute eigenvalues (CPU)) :            11.320     19873.04
Mflops
 Seclr4(Backtransform (CPU)) :                   0.250     11875.01
Mflops
       TIME HAMILT (CPU)  =     5.7, HNS =     2.6, HORB =     0.0,
DIAG =    16.3
       TIME HAMILT (WALL) =     5.9, HNS =     2.6, HORB =     0.0,
DIAG =    16.3

real	0m25.587s
user	0m24.857s
sys	0m0.321s
-------------

So I apologize for my statement in the last email that was too
ambitious. Indeed in this particular case the opensource stack is ~12%
slower (25 vs 28 seconds). Most of this is in the DIAG part (which I
believe is where OpenBLAS comes to play). However on some other (older)
Intel CPUs the DIAG part can be even faster with OpenBLAS, see the
already mentioned email by prof. Blaha https://www.mail-archive.com/wie
n at zeus.theochem.tuwien.ac.at/msg15106.html where he tested on i7-3930K
(sandybridge), hence for those older CPUs I would expect the
performance to be really comparable (with the small patch to utilize
the libmvec in order to speed up the HAMILT part).

In general the opensource support is usually slow to materialize hence
the performance on older CPUs is better. Especially in the OpenBLAS
where the optimizations for new CPUs and instruction sets are not
provided by Intel (contrary to the gcc, gfrortran and glibc where Intel
engineers contribute directly) while the MKL and ifort have good
support from day 1.

I do agree that it is better to advise users to use MKL+ifort since
when they have it properly installed the siteconfig is almost always
able to detect and build everything out of the box with default config.
This is unfortunately not the case with the opensource libraries, where
the detection does not work most of time due to distro differences and
the unfortunate fact that majority of the needed libraries does not
provide any good means for autodetection (e.g. proper package config
files), hence the user must edit the compiler flags by hand. I just
believe that the "ifort is always much faster that gfortran" dogma is
no longer always true.

Best regards
Pavel


More information about the Wien mailing list