[Wien] some more benchmarks

Thu Nov 13 14:35:38 CET 2003

Hello,
we did some more benchmarks, using the same test case as that in previous messages on the mailing list (about a month ago?).
We tested :
* on a cluster of dual 2.4 Ghz xeon machines (2 Gb per node)
* on a dual 1.8 Ghz opteron (64 bits) machine (4 Gb)
The test was : running lapw1c for one k-point (like in the previous mails). 

ON THE XEON CLUSTER :
We checked the default wien libraries (lapack_lapw and blas_lapw); the mkl 6.0 libraries; and the libgoto_p4_512-r0.6.so blas library from http://www.cs.utexas.edu/users/kgoto/signup_first.html (in spite of what the link says, you don't have to sign up for anything at all) (this last option was compiled using also the lapack_lapw, since it doesn't contain all lapack routines, but only blas and part of lapack).  We use ifc 7 for compilation.  We also checked the impact of optimization.

no -O3 flag :
lapw libraries: 875 s
     beo-18(1) 874.540u 3.480s 14:54.16 98.1%   0+0k 0+0io 232pf+0w
mkl : 343 s
     beo-18(1) 343.330u 1.020s 5:44.38 99.9%    0+0k 0+0io 353pf+0w
goto (we still use lapack_lapw for some lapack routines) : 312 s
     beo-18(1) 311.580u 1.040s 5:12.96 99.8%    0+0k 0+0io 449pf+0w

using -O3 flag :
lapw : 839 s
     beo-18(1) 838.660u 1.360s 14:00.18 99.9%   0+0k 0+0io 238pf+0w
mkl : 341 s
     beo-18(1) 341.360u 1.380s 5:43.07 99.9%    0+0k 0+0io 353pf+0w
goto : 304 s
     beo-18(1) 303.660u 0.980s 5:04.64 100.0%   0+0k 0+0io 453pf+0w

So the goto blas seems to be clearly superior, it's even faster than the mkl!  It exists, by the way, in lots of flavours, for different kinds of machines, all to be found at the url mentioned earlier.

ON OPTERON
We had many problems compiling on the opteron.  We could make the acml library work, but not the atlas library, and not the goto libraries specifically designed for opteron.  Probably they can be made to work, but it will require a little more effort.  We feel that the performance of the opteron could be significantly enhanced still (eg. by getting the goto blas to work).  As it seems many people are now trying to work on an opteron and all are encountering the same problems, there's probably a good chance we'll get there soon :-)
Thanks to R. Fehrenbacher and M. Todorova, who gave very helpful advise on the opteron compilation.
using an executable compiled on xeon (ifc7.0 -O3 -xW) using goto library :
    401 s  
using ifc+mkl, compiled on xeon, no optimization (O3)
     696s
using the AMD 64 bit ACML libraries, compiled on the opteron with pgf90 64 bit and optimization : 445 s
     430 s

Another issue which was not adressed in previous benchmarks, but that we feel is worth mentioning, is the capability of the machine to handle more than one job at once (eg. two jobs per two processor node seems reasonable :-)).  So we launch our test job twice on one node.
Here the opteron definitely outperforms the xeon : 
OPTERON (acml)
time for one job : 430 s
time for two jobs : 438 s
XEON (goto)
time for one job : 307 s
time for two jobs : 375 s

Conclusions :
* goto is the fastest blas I know (for xeon, at least)
* additional work is needed to optimize the performance of the opteron
* the opteron is definitely more efficient in handling two jobs at once

Kevin.