[Wien] wien2k, gotoblas and multi threads

Wed Aug 13 18:28:50 CEST 2008

I get much better timings for the serial benchmark using an ifort+mkl
version of wien2k on the same machine.  I'm not seeing any speedup
with k-point parallelization yet though.

- machine: dual Xeon quad-core E5430 @ 2.66GHz with 8GB 667MHz RAM

1) timings for wien2k-08.2-20080407 built with
- ifort 10.1.017
- mkl 10.0.3.020

1.1) wien2k serial benchmark
- x lapw1 -c
- varying OMP_NUM_THREADS from 1 to 8

OMP_NUM_THREADS=1: 116.292u 0.386s 1:56.69 99.9%        0+0k 0+33256io 0pf+0w
OMP_NUM_THREADS=2: 148.964u 0.963s 1:17.11 194.4%       0+0k 0+33240io 0pf+0w
OMP_NUM_THREADS=3: 182.932u 1.495s 1:11.11 259.3%       0+0k 0+33240io 0pf+0w
OMP_NUM_THREADS=4: 213.973u 1.356s 1:03.52 338.9%       0+0k 0+33240io 0pf+0w
OMP_NUM_THREADS=5: 251.813u 2.195s 1:03.51 399.9%       0+0k 0+33240io 0pf+0w
OMP_NUM_THREADS=6: 294.103u 2.429s 1:02.11 477.4%       0+0k 0+33240io 0pf+0w
OMP_NUM_THREADS=7: 329.413u 2.686s 1:01.91 536.4%       0+0k 0+33240io 0pf+0w
OMP_NUM_THREADS=8: 374.467u 2.488s 1:01.12 616.7%       0+0k 0+33240io 0pf+0w

1.2) wien2k serial benchmark run with k-point parallelism
- process started with command 'x lapw1 -p'
- OMP_NUM_THREADS=1, GOTO_NUM_THREADS=1
- varying .machines file with N lines, N from 1 to 8, where each line is:

1:localhost

k-point parallel N=1:    localhost       k=1     user=116.173    wallclock=116.59
k-point parallel N=2:    localhost       k=1     user=116.312    wallclock=116.79
k-point parallel N=3:    localhost       k=1     user=116.254    wallclock=116.66
k-point parallel N=4:    localhost       k=1     user=116.306    wallclock=116.76
k-point parallel N=5:    localhost       k=1     user=116.09     wallclock=116.52
k-point parallel N=6:    localhost       k=1     user=116.218    wallclock=116.66
k-point parallel N=7:    localhost       k=1     user=116.251    wallclock=116.68
k-point parallel N=8:    localhost       k=1     user=116.372    wallclock=116.79

2) timings for wien2k-08.2-20080407 built with
- GNU Fortran (GCC) 4.2.3 (4.2.3-6mnb1)
- GotoBLAS-1.26

2.1) wien2k serial benchmark
- x lapw1 -c
- varying OMP_NUM_THREADS from 1 to 8

OMP_NUM_THREADS=1: 195.463u 0.307s 3:15.80 99.9%        0+0k 0+33264io 0pf+0w
OMP_NUM_THREADS=2: 199.565u 0.569s 2:57.40 112.8%       0+0k 0+33264io 0pf+0w
OMP_NUM_THREADS=3: 204.145u 0.635s 2:51.02 119.7%       0+0k 0+33264io 0pf+0w
OMP_NUM_THREADS=4: 211.666u 0.736s 2:49.02 125.6%       0+0k 0+33264io 0pf+0w
OMP_NUM_THREADS=5: 222.604u 1.032s 2:48.41 132.7%       0+0k 0+33264io 0pf+0w
OMP_NUM_THREADS=6: 231.258u 0.927s 2:47.54 138.5%       0+0k 0+33264io 0pf+0w
OMP_NUM_THREADS=7: 243.170u 0.996s 2:46.55 146.5%       0+0k 0+33264io 0pf+0w
OMP_NUM_THREADS=8: 252.584u 0.916s 2:46.57 152.1%       0+0k 0+33264io 0pf+0w

--
Todd Pfaff <pfaff at mcmaster.ca>
Research & High-Performance Computing Support
McMaster University, Hamilton, Ontario, Canada
http://www.rhpcs.mcmaster.ca/~pfaff

On Tue, 12 Aug 2008, Peter Blaha wrote:

> Looking on these numbers tells me, that you probably should invest into
> ifort + mkl. It does not make sense to buy expensive new hardware, but
> with bad software it runs slower than on a 6 year old PC.
> Compare your timing with the benchmark page to see what is possible.
>
> k-point parallelization: Please read the UG !!! This is fairly simple.
>
> 1:localhost:4    utilizes the mpi-parallel version;
>
> you need to put N-lines
>
> 1:localhost
> 1:localhost
> ...
>
> to specify running N lapw1 processes in parallel.
>
> Todd Pfaff schrieb:
>> Peter, thanks for the response.
>>
>> I'm getting small speedup from multithreading in libgoto.  Here are
>> timings from the wien2k serial benchmark:
>>
>> OMP_NUM_THREADS=1: 195.463u 0.307s 3:15.80 99.9%        0+0k 0+33264io 0pf+0w
>> OMP_NUM_THREADS=2: 199.565u 0.569s 2:57.40 112.8%       0+0k 0+33264io 0pf+0w
>> OMP_NUM_THREADS=3: 204.145u 0.635s 2:51.02 119.7%       0+0k 0+33264io 0pf+0w
>> OMP_NUM_THREADS=4: 211.666u 0.736s 2:49.02 125.6%       0+0k 0+33264io 0pf+0w
>> OMP_NUM_THREADS=5: 222.604u 1.032s 2:48.41 132.7%       0+0k 0+33264io 0pf+0w
>> OMP_NUM_THREADS=6: 231.258u 0.927s 2:47.54 138.5%       0+0k 0+33264io 0pf+0w
>> OMP_NUM_THREADS=7: 243.170u 0.996s 2:46.55 146.5%       0+0k 0+33264io 0pf+0w
>> OMP_NUM_THREADS=8: 252.584u 0.916s 2:46.57 152.1%       0+0k 0+33264io 0pf+0w
>>
>>
>> I would like explore the k-point parallelization.  But when I run
>> 'x lapw1 -p' it aborts with an error message about being unable to run
>> lapw1c_mpi.  This appears to me like it's trying to run the fine grained
>> MPI parallel version.  I'm not building wien2k with mpi so I don't have a
>> lapw1c_mpi.  I must be misunderstanding something.  What am I doing wrong
>> that's causing it to try to run this lapw1c_mpi executable?
>>
>> Which of these are appropriate .machines files to do k-point
>> parallelization across N cpu cores on a single machine?
>>
>> This?
>>
>>    1:localhost:N
>>
>> Or this?
>>
>>    N:localhost
>>
>> And do I need any of these lines?
>>
>>    extrafine
>>    granularity:1
>>    residue:localhost
>>
>> Or do I need something else either in .machines or in some other
>> file or on the command line?
>>
>> --
>> Todd Pfaff <pfaff at mcmaster.ca>
>> Research & High-Performance Computing Support
>> McMaster University, Hamilton, Ontario, Canada
>> http://www.rhpcs.mcmaster.ca/~pfaff
>>
>> On Mon, 11 Aug 2008, Peter Blaha wrote:
>>
>>> The program lapw1 spends a large fraction in BLAS-routines, thus it can
>>> benefit from multithreading of GOTOLIBS (or MKL).
>>> Setting the variables you mentioned to 2 (or 4) you should see a
>>> speedup. The improvement may depend on many factors but it will be at
>>> most about 50%.
>>>
>>> Another possibility to utilize the multiple cores is to do k-point
>>> parallelism.
>>> Generate a .machines file with 2,4 or 8  times your machine name
>>> and test the performance with     x lapw1 -p.
>>> On some architectures (with slow memory bus) it can be that only 4
>>> parallel jobs give best performance (because the slow memory bus cannot
>>> feed all 8 cpus properly), on others you can use 8 parallel jobs.
>>> Sometimes a mixture (4 k-point parallel + OMP_NUM_THREADS=2) is best.
>>>
>>> Todd Pfaff schrieb:
>>>> We're using:
>>>>
>>>>    wien2k-08.2-20080407
>>>>
>>>> built with:
>>>>
>>>>    GNU Fortran (GCC) 4.2.3 (4.2.3-6mnb1)
>>>>    GotoBLAS-1.26
>>>>
>>>> and running on an 8 core (2 x quad core) Xeon machine.
>>>>
>>>> Can wien2k take advantage of multithreading inherent to GotoBLAS
>>>> when either GOTO_NUM_THREADS or OMP_NUM_THREADS is set?
>>>>
>>>> If so, can someone provide, or direct me to a document about details of
>>>> the best way to build and run wien2k for such an environment?
>>>>
>>>> Thank you,
>>>> --
>>>> Todd Pfaff <pfaff at mcmaster.ca>
>>>> Research & High-Performance Computing Support
>>>> McMaster University, Hamilton, Ontario, Canada
>>>> http://www.rhpcs.mcmaster.ca/~pfaff
>>>> _______________________________________________
>>>> Wien mailing list
>>>> Wien at zeus.theochem.tuwien.ac.at
>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
>