[Wien] wien2k, gotoblas and multi threads

Wed Aug 13 19:21:37 CEST 2008

And I'm sure if my colleagues from physics/chemistry tried to explain
the "k-points" to me I'd still be baffled.  :)

I'm trying to encourage them - the physics/chemistry researchers for whom
I am being asked to do this computer systems work - to subscribe to the
wien2k mailing list themselves since it's clearly they who are better 
equipped to understand the discussions there.

Thanks very much for your assistance.

--
Todd Pfaff <pfaff at mcmaster.ca>
Research & High-Performance Computing Support
McMaster University, Hamilton, Ontario, Canada
http://www.rhpcs.mcmaster.ca/~pfaff

On Wed, 13 Aug 2008, Peter Blaha wrote:

> These numbers make "sense".
>
> k-point parallelization:
>
> In "real" cases, one has to solve an eigenvalue problem for "many"
> k-points ("Many" means typically 10-1000). In these cases, k-point
> parallelism is very efficient.
>
> The benchmark case has only ONE k-point in its *.klist file, thus
> there's no k-point parallelism.
>
> When you edit the *.klist file (and eg. repeat the 1st line 8 times),
> you will see, that the sequential run will take almost exactly 8 times
> as long. However, whith k-point parallelism you will probably get a
> speedup of 4-6 on your machine.
>
> Still one can find out, what is more efficient on your specific machine:
> use (for 8 k-points) 8 lines in .machines and OMP_NUM_THREAD=1 ; or only
> 4 lines and OMP..=2
>
> I'm sure the colleagues from physics/chemistry can explain the
> "k-points" to you.
>
> Regards
>
> Todd Pfaff schrieb:
>> I get much better timings for the serial benchmark using an ifort+mkl
>> version of wien2k on the same machine.  I'm not seeing any speedup
>> with k-point parallelization yet though.
>>
>> - machine: dual Xeon quad-core E5430 @ 2.66GHz with 8GB 667MHz RAM
>>
>> 1) timings for wien2k-08.2-20080407 built with
>> - ifort 10.1.017
>> - mkl 10.0.3.020
>>
>> 1.1) wien2k serial benchmark
>> - x lapw1 -c
>> - varying OMP_NUM_THREADS from 1 to 8
>>
>> OMP_NUM_THREADS=1: 116.292u 0.386s 1:56.69 99.9%        0+0k 0+33256io 0pf+0w
>> OMP_NUM_THREADS=2: 148.964u 0.963s 1:17.11 194.4%       0+0k 0+33240io 0pf+0w
>> OMP_NUM_THREADS=3: 182.932u 1.495s 1:11.11 259.3%       0+0k 0+33240io 0pf+0w
>> OMP_NUM_THREADS=4: 213.973u 1.356s 1:03.52 338.9%       0+0k 0+33240io 0pf+0w
>> OMP_NUM_THREADS=5: 251.813u 2.195s 1:03.51 399.9%       0+0k 0+33240io 0pf+0w
>> OMP_NUM_THREADS=6: 294.103u 2.429s 1:02.11 477.4%       0+0k 0+33240io 0pf+0w
>> OMP_NUM_THREADS=7: 329.413u 2.686s 1:01.91 536.4%       0+0k 0+33240io 0pf+0w
>> OMP_NUM_THREADS=8: 374.467u 2.488s 1:01.12 616.7%       0+0k 0+33240io 0pf+0w
>>
>> 1.2) wien2k serial benchmark run with k-point parallelism
>> - process started with command 'x lapw1 -p'
>> - OMP_NUM_THREADS=1, GOTO_NUM_THREADS=1
>> - varying .machines file with N lines, N from 1 to 8, where each line is:
>>
>> 1:localhost
>>
>> k-point parallel N=1:    localhost       k=1     user=116.173    wallclock=116.59
>> k-point parallel N=2:    localhost       k=1     user=116.312    wallclock=116.79
>> k-point parallel N=3:    localhost       k=1     user=116.254    wallclock=116.66
>> k-point parallel N=4:    localhost       k=1     user=116.306    wallclock=116.76
>> k-point parallel N=5:    localhost       k=1     user=116.09     wallclock=116.52
>> k-point parallel N=6:    localhost       k=1     user=116.218    wallclock=116.66
>> k-point parallel N=7:    localhost       k=1     user=116.251    wallclock=116.68
>> k-point parallel N=8:    localhost       k=1     user=116.372    wallclock=116.79
>>
>>
>> 2) timings for wien2k-08.2-20080407 built with
>> - GNU Fortran (GCC) 4.2.3 (4.2.3-6mnb1)
>> - GotoBLAS-1.26
>>
>> 2.1) wien2k serial benchmark
>> - x lapw1 -c
>> - varying OMP_NUM_THREADS from 1 to 8
>>
>> OMP_NUM_THREADS=1: 195.463u 0.307s 3:15.80 99.9%        0+0k 0+33264io 0pf+0w
>> OMP_NUM_THREADS=2: 199.565u 0.569s 2:57.40 112.8%       0+0k 0+33264io 0pf+0w
>> OMP_NUM_THREADS=3: 204.145u 0.635s 2:51.02 119.7%       0+0k 0+33264io 0pf+0w
>> OMP_NUM_THREADS=4: 211.666u 0.736s 2:49.02 125.6%       0+0k 0+33264io 0pf+0w
>> OMP_NUM_THREADS=5: 222.604u 1.032s 2:48.41 132.7%       0+0k 0+33264io 0pf+0w
>> OMP_NUM_THREADS=6: 231.258u 0.927s 2:47.54 138.5%       0+0k 0+33264io 0pf+0w
>> OMP_NUM_THREADS=7: 243.170u 0.996s 2:46.55 146.5%       0+0k 0+33264io 0pf+0w
>> OMP_NUM_THREADS=8: 252.584u 0.916s 2:46.57 152.1%       0+0k 0+33264io 0pf+0w
>>
>>
>> --
>> Todd Pfaff <pfaff at mcmaster.ca>
>> Research & High-Performance Computing Support
>> McMaster University, Hamilton, Ontario, Canada
>> http://www.rhpcs.mcmaster.ca/~pfaff
>>
>>
>> On Tue, 12 Aug 2008, Peter Blaha wrote:
>>
>>> Looking on these numbers tells me, that you probably should invest into
>>> ifort + mkl. It does not make sense to buy expensive new hardware, but
>>> with bad software it runs slower than on a 6 year old PC.
>>> Compare your timing with the benchmark page to see what is possible.
>>>
>>> k-point parallelization: Please read the UG !!! This is fairly simple.
>>>
>>> 1:localhost:4    utilizes the mpi-parallel version;
>>>
>>> you need to put N-lines
>>>
>>> 1:localhost
>>> 1:localhost
>>> ...
>>>
>>> to specify running N lapw1 processes in parallel.
>>>
>>> Todd Pfaff schrieb:
>>>> Peter, thanks for the response.
>>>>
>>>> I'm getting small speedup from multithreading in libgoto.  Here are
>>>> timings from the wien2k serial benchmark:
>>>>
>>>> OMP_NUM_THREADS=1: 195.463u 0.307s 3:15.80 99.9%        0+0k 0+33264io 0pf+0w
>>>> OMP_NUM_THREADS=2: 199.565u 0.569s 2:57.40 112.8%       0+0k 0+33264io 0pf+0w
>>>> OMP_NUM_THREADS=3: 204.145u 0.635s 2:51.02 119.7%       0+0k 0+33264io 0pf+0w
>>>> OMP_NUM_THREADS=4: 211.666u 0.736s 2:49.02 125.6%       0+0k 0+33264io 0pf+0w
>>>> OMP_NUM_THREADS=5: 222.604u 1.032s 2:48.41 132.7%       0+0k 0+33264io 0pf+0w
>>>> OMP_NUM_THREADS=6: 231.258u 0.927s 2:47.54 138.5%       0+0k 0+33264io 0pf+0w
>>>> OMP_NUM_THREADS=7: 243.170u 0.996s 2:46.55 146.5%       0+0k 0+33264io 0pf+0w
>>>> OMP_NUM_THREADS=8: 252.584u 0.916s 2:46.57 152.1%       0+0k 0+33264io 0pf+0w
>>>>
>>>>
>>>> I would like explore the k-point parallelization.  But when I run
>>>> 'x lapw1 -p' it aborts with an error message about being unable to run
>>>> lapw1c_mpi.  This appears to me like it's trying to run the fine grained
>>>> MPI parallel version.  I'm not building wien2k with mpi so I don't have a
>>>> lapw1c_mpi.  I must be misunderstanding something.  What am I doing wrong
>>>> that's causing it to try to run this lapw1c_mpi executable?
>>>>
>>>> Which of these are appropriate .machines files to do k-point
>>>> parallelization across N cpu cores on a single machine?
>>>>
>>>> This?
>>>>
>>>>    1:localhost:N
>>>>
>>>> Or this?
>>>>
>>>>    N:localhost
>>>>
>>>> And do I need any of these lines?
>>>>
>>>>    extrafine
>>>>    granularity:1
>>>>    residue:localhost
>>>>
>>>> Or do I need something else either in .machines or in some other
>>>> file or on the command line?
>>>>
>>>> --
>>>> Todd Pfaff <pfaff at mcmaster.ca>
>>>> Research & High-Performance Computing Support
>>>> McMaster University, Hamilton, Ontario, Canada
>>>> http://www.rhpcs.mcmaster.ca/~pfaff
>>>>
>>>> On Mon, 11 Aug 2008, Peter Blaha wrote:
>>>>
>>>>> The program lapw1 spends a large fraction in BLAS-routines, thus it can
>>>>> benefit from multithreading of GOTOLIBS (or MKL).
>>>>> Setting the variables you mentioned to 2 (or 4) you should see a
>>>>> speedup. The improvement may depend on many factors but it will be at
>>>>> most about 50%.
>>>>>
>>>>> Another possibility to utilize the multiple cores is to do k-point
>>>>> parallelism.
>>>>> Generate a .machines file with 2,4 or 8  times your machine name
>>>>> and test the performance with     x lapw1 -p.
>>>>> On some architectures (with slow memory bus) it can be that only 4
>>>>> parallel jobs give best performance (because the slow memory bus cannot
>>>>> feed all 8 cpus properly), on others you can use 8 parallel jobs.
>>>>> Sometimes a mixture (4 k-point parallel + OMP_NUM_THREADS=2) is best.
>>>>>
>>>>> Todd Pfaff schrieb:
>>>>>> We're using:
>>>>>>
>>>>>>    wien2k-08.2-20080407
>>>>>>
>>>>>> built with:
>>>>>>
>>>>>>    GNU Fortran (GCC) 4.2.3 (4.2.3-6mnb1)
>>>>>>    GotoBLAS-1.26
>>>>>>
>>>>>> and running on an 8 core (2 x quad core) Xeon machine.
>>>>>>
>>>>>> Can wien2k take advantage of multithreading inherent to GotoBLAS
>>>>>> when either GOTO_NUM_THREADS or OMP_NUM_THREADS is set?
>>>>>>
>>>>>> If so, can someone provide, or direct me to a document about details of
>>>>>> the best way to build and run wien2k for such an environment?
>>>>>>
>>>>>> Thank you,
>>>>>> --
>>>>>> Todd Pfaff <pfaff at mcmaster.ca>
>>>>>> Research & High-Performance Computing Support
>>>>>> McMaster University, Hamilton, Ontario, Canada
>>>>>> http://www.rhpcs.mcmaster.ca/~pfaff
>>>>>> _______________________________________________
>>>>>> Wien mailing list
>>>>>> Wien at zeus.theochem.tuwien.ac.at
>>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>> _______________________________________________
>>>> Wien mailing list
>>>> Wien at zeus.theochem.tuwien.ac.at
>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
>