[Wien] Installation with MPI and GNU compilers

Wed May 2 23:04:54 CEST 2018

---------- Původní e-mail ----------
Od: Fecher, Gerhard <fecher at uni-mainz.de>
Komu: Pavel Ondračka <pavel.ondracka at email.cz>, wien at zeus.theochem.tuwien.
ac.at <wien at zeus.theochem.tuwien.ac.at>
Datum: 2. 5. 2018 16:08:06
Předmět: AW: [Wien] Installation with MPI and GNU compilers 
"Dear Pavel, 
maybe it's better to ask Laurence, seems he was writing the VML things. 

I didn't look into the code within the last years, what I found on a fast 
look is: 

The only place where the INTEL_VML is used any longer seems to be in Hamilt.
f of LAPW1 
I found that it is commented in all other cases where it was once used. 

If you don't use INTEL_VML, the INTEL ifort will vectorice the loops in 
vectf.f of LAPW1 (see code in Hamilt.f that calls it) 
(as I mentioned, maybe one has to link the libsvml explicitely """

""
BTW is svml part of the MKL or do you need the ifort for that? 

" 
For example 
-O2 -xHost -qopt-report=1 -qopt-report-phase=vec 
will show you which loops were vectorized"

Indeed, if I add the -O2 and -xHost to the default Wien2k flags (with ifort 
and MKL) there is no performance hit if I remove the -DINTEL_VML.

"I could not see that the svml has a reduced accuracy, however, you can set 
the performance/accuracy level in the VML. 
What you can do is to set a threshhold for the loop size (similar to 
unroll), might need some short study of the manual. "

Interesting, I will try to run some tests for the speed and accuracy of some
basic trigonometric functions for ifort vs gfortran and standard glibc vs 
libmvec vs VML vs svml.

"
I could not see that in W2kinit.F a threshold for the loops (size of the 
arrays) was set, 
only the precision was set there for the INTEL_VML script, however, 
I guess that Laurence used it where only large arrays appeared. 

NB: I enjoy more questions about how to increase the speed or how to improve
the code. "

Well,  I do believe that the code is well optimized when you have the ifort 
+ MKL, however the rest of the options is a somewhat worse.

Since you can nowadays get the MKL library for free (but not the ifort) 
there is the combination of gfortran + MKL, which does not have any default 
config  and is slow as was reported by Rui in beginning of the thread. I'm 
quite sure this combination can be made almost as fast as the ifort + MKL 
(either by somewhat fixing the INTEL_VML define to fix the missing ifcore 
problem, or possibly by using the -mveclibabi=svml gfortran switch or some 
other trick). I'm not sure how many people have this setup though. 

The most problematic is the gfortran + OpenBLAS combination, where I was not
able to force gfortran use the vectorized (SIMD) math. It works with C code 
(which is why my approach to making lapw1 fast includes porting the vectf.f 
to C) but not with Fortran. It is possible there is some way to make this 
work but I had no luck so far. The libmvec has a public interface so it 
might be possible to call it directly similarly to the VML, however it would
introduce a lot of #ifdef LIBMVEC to the code which I guess is not a good 
idea. I would like to have this working better out of the box so I'll keep 
looking for some solution which would not require extensive changes in the 
code or siteconfig script. Dunno if the authors are accepting patches 
anyway...

Best regards

Pavel

" 
Ciao 
Gerhard 

DEEP THOUGHT in D. Adams; Hitchhikers Guide to the Galaxy: 
"I think the problem, to be quite honest with you, 
is that you have never actually known what the question is." 

==================================== 
Dr. Gerhard H. Fecher 
Institut of Inorganic and Analytical Chemistry 
Johannes Gutenberg - University 
55099 Mainz 
and 
Max Planck Institute for Chemical Physics of Solids 
01187 Dresden 
________________________________________ 
Von: Pavel Ondračka [pavel.ondracka at email.cz] 
Gesendet: Mittwoch, 2. Mai 2018 12:05 
An: Fecher, Gerhard 
Betreff: Re: [Wien] Installation with MPI and GNU compilers 

I'm using private answer since this might be getting too technical for 
the list and in fact not interesting for majority of users... 

Fecher, Gerhard píše v St 02. 05. 2018 v 09:00 +0000: 
> I never checked that: does the -DINTEL_VML switch correspond to the 
> VML library routines of MKL 
> or to the 
> SVML library routines of the compiler 

The lapw1 calls directly the VML library, for example the vdcos, vdsin 
functions, but I have not checked the rest of Wien2k. 

> this makes a difference, the svml routines are automatically invoked 
> by the INTEL compiler if one uses -O2 optimization or higher. 
> (check also the usage of the switches -vec, -no-vec, -vec-report) 
> 
> The VML routines of the MKL make only sense for appropriate sizes of 
> the vectors, otherwise, they may even slow down the program (how much 
> might also depend on threads etc.). 

The common usage of the VML in Wien2k is to call the VML functions with 
a _large_ array as an argument. So if I understand it correctly the 
vectorization is done inside the VML and the VML chooses the best 
intrinsic. Since the arrays are large, there is a speedup in all cases. 

BTW are you sure the -O2 switch alone will give you the svml 
intrinsic? IMO the svml intrinsic have different accuracy (might not be 
strictly IEEE compliant as compared to the scalar variants) so I would 
expect you need to specify it explicitly with some additional flag that 
you are OK with this (e.g. for GCC you need the -ffast-math switch to 
get the vectorized sse,avx goniometric fuctions from the libmvec). 

> A note (for the INTEL Fortran): 
> I vaguely remember that the -DINTEL_VML switch did not bring any 
> better performance, at that time one needed to give the -lsvml (with 
> path to the compiler libs) explicitely. 
> 
> Ciao 
> Gerhard 
> 
Best regards 
Pavel 
"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20180502/d8d8e38f/attachment.html>