[Wien] wien2k, gotoblas and multi threads

Wed Aug 13 19:04:44 CEST 2008

These numbers make "sense".

k-point parallelization:

In "real" cases, one has to solve an eigenvalue problem for "many" 
k-points ("Many" means typically 10-1000). In these cases, k-point 
parallelism is very efficient.

The benchmark case has only ONE k-point in its *.klist file, thus 
there's no k-point parallelism.

When you edit the *.klist file (and eg. repeat the 1st line 8 times), 
you will see, that the sequential run will take almost exactly 8 times 
as long. However, whith k-point parallelism you will probably get a 
speedup of 4-6 on your machine.

Still one can find out, what is more efficient on your specific machine:
use (for 8 k-points) 8 lines in .machines and OMP_NUM_THREAD=1 ; or only 
4 lines and OMP..=2

I'm sure the colleagues from physics/chemistry can explain the 
"k-points" to you.

Regards

Todd Pfaff schrieb:
> I get much better timings for the serial benchmark using an ifort+mkl
> version of wien2k on the same machine.  I'm not seeing any speedup
> with k-point parallelization yet though.
> 
> - machine: dual Xeon quad-core E5430 @ 2.66GHz with 8GB 667MHz RAM
> 
> 1) timings for wien2k-08.2-20080407 built with
> - ifort 10.1.017
> - mkl 10.0.3.020
> 
> 1.1) wien2k serial benchmark
> - x lapw1 -c
> - varying OMP_NUM_THREADS from 1 to 8
> 
> OMP_NUM_THREADS=1: 116.292u 0.386s 1:56.69 99.9%        0+0k 0+33256io 0pf+0w
> OMP_NUM_THREADS=2: 148.964u 0.963s 1:17.11 194.4%       0+0k 0+33240io 0pf+0w
> OMP_NUM_THREADS=3: 182.932u 1.495s 1:11.11 259.3%       0+0k 0+33240io 0pf+0w
> OMP_NUM_THREADS=4: 213.973u 1.356s 1:03.52 338.9%       0+0k 0+33240io 0pf+0w
> OMP_NUM_THREADS=5: 251.813u 2.195s 1:03.51 399.9%       0+0k 0+33240io 0pf+0w
> OMP_NUM_THREADS=6: 294.103u 2.429s 1:02.11 477.4%       0+0k 0+33240io 0pf+0w
> OMP_NUM_THREADS=7: 329.413u 2.686s 1:01.91 536.4%       0+0k 0+33240io 0pf+0w
> OMP_NUM_THREADS=8: 374.467u 2.488s 1:01.12 616.7%       0+0k 0+33240io 0pf+0w
> 
> 1.2) wien2k serial benchmark run with k-point parallelism
> - process started with command 'x lapw1 -p'
> - OMP_NUM_THREADS=1, GOTO_NUM_THREADS=1
> - varying .machines file with N lines, N from 1 to 8, where each line is:
> 
> 1:localhost
> 
> k-point parallel N=1:    localhost       k=1     user=116.173    wallclock=116.59
> k-point parallel N=2:    localhost       k=1     user=116.312    wallclock=116.79
> k-point parallel N=3:    localhost       k=1     user=116.254    wallclock=116.66
> k-point parallel N=4:    localhost       k=1     user=116.306    wallclock=116.76
> k-point parallel N=5:    localhost       k=1     user=116.09     wallclock=116.52
> k-point parallel N=6:    localhost       k=1     user=116.218    wallclock=116.66
> k-point parallel N=7:    localhost       k=1     user=116.251    wallclock=116.68
> k-point parallel N=8:    localhost       k=1     user=116.372    wallclock=116.79
> 
> 
> 2) timings for wien2k-08.2-20080407 built with
> - GNU Fortran (GCC) 4.2.3 (4.2.3-6mnb1)
> - GotoBLAS-1.26
> 
> 2.1) wien2k serial benchmark
> - x lapw1 -c
> - varying OMP_NUM_THREADS from 1 to 8
> 
> OMP_NUM_THREADS=1: 195.463u 0.307s 3:15.80 99.9%        0+0k 0+33264io 0pf+0w
> OMP_NUM_THREADS=2: 199.565u 0.569s 2:57.40 112.8%       0+0k 0+33264io 0pf+0w
> OMP_NUM_THREADS=3: 204.145u 0.635s 2:51.02 119.7%       0+0k 0+33264io 0pf+0w
> OMP_NUM_THREADS=4: 211.666u 0.736s 2:49.02 125.6%       0+0k 0+33264io 0pf+0w
> OMP_NUM_THREADS=5: 222.604u 1.032s 2:48.41 132.7%       0+0k 0+33264io 0pf+0w
> OMP_NUM_THREADS=6: 231.258u 0.927s 2:47.54 138.5%       0+0k 0+33264io 0pf+0w
> OMP_NUM_THREADS=7: 243.170u 0.996s 2:46.55 146.5%       0+0k 0+33264io 0pf+0w
> OMP_NUM_THREADS=8: 252.584u 0.916s 2:46.57 152.1%       0+0k 0+33264io 0pf+0w
> 
> 
> --
> Todd Pfaff <pfaff at mcmaster.ca>
> Research & High-Performance Computing Support
> McMaster University, Hamilton, Ontario, Canada
> http://www.rhpcs.mcmaster.ca/~pfaff
> 
> 
> On Tue, 12 Aug 2008, Peter Blaha wrote:
> 
>> Looking on these numbers tells me, that you probably should invest into
>> ifort + mkl. It does not make sense to buy expensive new hardware, but
>> with bad software it runs slower than on a 6 year old PC.
>> Compare your timing with the benchmark page to see what is possible.
>>
>> k-point parallelization: Please read the UG !!! This is fairly simple.
>>
>> 1:localhost:4    utilizes the mpi-parallel version;
>>
>> you need to put N-lines
>>
>> 1:localhost
>> 1:localhost
>> ...
>>
>> to specify running N lapw1 processes in parallel.
>>
>> Todd Pfaff schrieb:
>>> Peter, thanks for the response.
>>>
>>> I'm getting small speedup from multithreading in libgoto.  Here are
>>> timings from the wien2k serial benchmark:
>>>
>>> OMP_NUM_THREADS=1: 195.463u 0.307s 3:15.80 99.9%        0+0k 0+33264io 0pf+0w
>>> OMP_NUM_THREADS=2: 199.565u 0.569s 2:57.40 112.8%       0+0k 0+33264io 0pf+0w
>>> OMP_NUM_THREADS=3: 204.145u 0.635s 2:51.02 119.7%       0+0k 0+33264io 0pf+0w
>>> OMP_NUM_THREADS=4: 211.666u 0.736s 2:49.02 125.6%       0+0k 0+33264io 0pf+0w
>>> OMP_NUM_THREADS=5: 222.604u 1.032s 2:48.41 132.7%       0+0k 0+33264io 0pf+0w
>>> OMP_NUM_THREADS=6: 231.258u 0.927s 2:47.54 138.5%       0+0k 0+33264io 0pf+0w
>>> OMP_NUM_THREADS=7: 243.170u 0.996s 2:46.55 146.5%       0+0k 0+33264io 0pf+0w
>>> OMP_NUM_THREADS=8: 252.584u 0.916s 2:46.57 152.1%       0+0k 0+33264io 0pf+0w
>>>
>>>
>>> I would like explore the k-point parallelization.  But when I run
>>> 'x lapw1 -p' it aborts with an error message about being unable to run
>>> lapw1c_mpi.  This appears to me like it's trying to run the fine grained
>>> MPI parallel version.  I'm not building wien2k with mpi so I don't have a
>>> lapw1c_mpi.  I must be misunderstanding something.  What am I doing wrong
>>> that's causing it to try to run this lapw1c_mpi executable?
>>>
>>> Which of these are appropriate .machines files to do k-point
>>> parallelization across N cpu cores on a single machine?
>>>
>>> This?
>>>
>>>    1:localhost:N
>>>
>>> Or this?
>>>
>>>    N:localhost
>>>
>>> And do I need any of these lines?
>>>
>>>    extrafine
>>>    granularity:1
>>>    residue:localhost
>>>
>>> Or do I need something else either in .machines or in some other
>>> file or on the command line?
>>>
>>> --
>>> Todd Pfaff <pfaff at mcmaster.ca>
>>> Research & High-Performance Computing Support
>>> McMaster University, Hamilton, Ontario, Canada
>>> http://www.rhpcs.mcmaster.ca/~pfaff
>>>
>>> On Mon, 11 Aug 2008, Peter Blaha wrote:
>>>
>>>> The program lapw1 spends a large fraction in BLAS-routines, thus it can
>>>> benefit from multithreading of GOTOLIBS (or MKL).
>>>> Setting the variables you mentioned to 2 (or 4) you should see a
>>>> speedup. The improvement may depend on many factors but it will be at
>>>> most about 50%.
>>>>
>>>> Another possibility to utilize the multiple cores is to do k-point
>>>> parallelism.
>>>> Generate a .machines file with 2,4 or 8  times your machine name
>>>> and test the performance with     x lapw1 -p.
>>>> On some architectures (with slow memory bus) it can be that only 4
>>>> parallel jobs give best performance (because the slow memory bus cannot
>>>> feed all 8 cpus properly), on others you can use 8 parallel jobs.
>>>> Sometimes a mixture (4 k-point parallel + OMP_NUM_THREADS=2) is best.
>>>>
>>>> Todd Pfaff schrieb:
>>>>> We're using:
>>>>>
>>>>>    wien2k-08.2-20080407
>>>>>
>>>>> built with:
>>>>>
>>>>>    GNU Fortran (GCC) 4.2.3 (4.2.3-6mnb1)
>>>>>    GotoBLAS-1.26
>>>>>
>>>>> and running on an 8 core (2 x quad core) Xeon machine.
>>>>>
>>>>> Can wien2k take advantage of multithreading inherent to GotoBLAS
>>>>> when either GOTO_NUM_THREADS or OMP_NUM_THREADS is set?
>>>>>
>>>>> If so, can someone provide, or direct me to a document about details of
>>>>> the best way to build and run wien2k for such an environment?
>>>>>
>>>>> Thank you,
>>>>> --
>>>>> Todd Pfaff <pfaff at mcmaster.ca>
>>>>> Research & High-Performance Computing Support
>>>>> McMaster University, Hamilton, Ontario, Canada
>>>>> http://www.rhpcs.mcmaster.ca/~pfaff
>>>>> _______________________________________________
>>>>> Wien mailing list
>>>>> Wien at zeus.theochem.tuwien.ac.at
>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

-- 
-----------------------------------------
Peter Blaha
Inst. Materials Chemistry, TU Vienna
Getreidemarkt 9, A-1060 Vienna, Austria
Tel: +43-1-5880115671
Fax: +43-1-5880115698
email: pblaha at theochem.tuwien.ac.at
-----------------------------------------