[Wien] vectorized math functions with GCC (to speedup lapw1)

Tue Jul 25 15:01:30 CEST 2017

Dear Wien2k mailing list,

just thought I would share this, maybe someone can also benefit.

It has been brought up few times that the reason why the gfortran +
OpenBLAS are slower that ifort + MKL is the inability of the GNU
compiler stack to vectorize the math functions as opposed to the Intel
compiler which provides the VML. Recently, I found out that the new
versions of Glibc have a vectorized math library libmvec https://source
ware.org/glibc/wiki/libmvec so I gave it a try, to see if I can
vectorize the trigonometric functions (I only focused on the ones in
vectf.f file). Unfortunately I was not able to get it working with the
gfortran, even though I believe it uses the C math library, just not
the vectorized one.

However, porting the vectf.f to C, the GNU C compiler manages to
vectorize it without any problems. So just replacing the vectf.f with
the C version, compiling the C file by hand "gcc -c vectf.c -O2 -ftree-
loop-vectorize -ffast-math -march=native" folowed by normal "make
{complex,...}" in the lapw1 directory builds the binary with the
vectorized functions (the -march=native might be overkill, -mavx is
probably all that is needed + or course CPU with avx support).
No special library is needed (it comes with glibc, hence is installed
by default in any recent distro), just compiling the C version of vectf
on glibc 2.23+ with GCC 6+ and "-02 -ftree-loop-vectorize -ffast-math
-mavx" switches. The rest of the subroutines can be compiled and linked
with the default settings.

The resulting timing differences for the serial benchmark on 1 core
(Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz):

libmvec:
Time for al,bl    (hamilt, cpu/wall) :          0.3         0.3
Time for legendre (hamilt, cpu/wall) :          0.2         0.2
Time for phase    (hamilt, cpu/wall) :          1.1         1.3
Time for us       (hamilt, cpu/wall) :          1.2         1.2
Time for overlaps (hamilt, cpu/wall) :          3.0         3.0
Time for distrib  (hamilt, cpu/wall) :          0.0         0.1
Time sum iouter   (hamilt, cpu/wall) :          6.0         6.1

default gfortran:
Time for al,bl    (hamilt, cpu/wall) :          0.3         0.3
Time for legendre (hamilt, cpu/wall) :          0.1         0.2
Time for phase    (hamilt, cpu/wall) :         20.3        20.4
Time for us       (hamilt, cpu/wall) :          4.8         4.8
Time for overlaps (hamilt, cpu/wall) :          2.8         2.8
Time for distrib  (hamilt, cpu/wall) :          0.1         0.1
Time sum iouter   (hamilt, cpu/wall) :         28.4        28.6

The total time for the lapw1 run of the serial benchamrk goes from ~52s
to ~30s (GCC 6.3.1, openblas 0.2.19).

One thing I was quite nervous about is the precision, specifically the
usage of -ffast-math switch, since with it the math functions are no
longer IEEE compliant. And indeed there are some small differences
(looking at the first few energy eigenvalues). However when compared to
output from ifort + mkl, the results are actually more in line with the
libmvec, so whatever is -ffast-math doing it seems to be similar stuff
as the intel compiler does. This seems quite OK to me, but I'll be
interested to hear what the developers think about this. I have run few
simple test cases so far without problems, but nothing exhaustive.

default gfortran:
           1 -0.76923963235447435     
           2 -0.70596761593017998     
           3 -0.67568777667231306     
           4 -0.67388768026644197     
           5 -0.65988467846896359     

with libmvec:
           1 -0.76923963235445836     
           2 -0.70596761593017998     
           3 -0.67568777667232904     
           4 -0.67388768026645796     
           5 -0.65988467846894761     

ifort + MKL + VML (this was done with Wien2k 16, but I suppose this
should not matter)
           1 -0.769239632354426
           2 -0.705967615930132
           3 -0.675687776672265
           4 -0.673887680266394
           5 -0.659884678468836

I do not have any direct comparison of libmvec vs VML timing, since my
test machines do not have MKL and the machines with MKL have enterprise
distros with ancient gcc and glibc.
The vectf.c file is attached and I'll be interested if anyone can
reproduce the speedup. Or if anyone can spot any possible problems with
this approach. It is really only a naive attempt and I guess a proper C
to fortran wrappers would be needed since I'm not sure if the current
way is really OK (or even if the C code in lapw1 would be acceptable).
Maybe it is even possible to use the libmvec from pure fortran code but
I was not able to do that.

Best regards
Pavel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vectf.c
Type: text/x-csrc
Size: 637 bytes
Desc: not available
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20170725/79cce4f0/attachment.c>