[Wien] pondering wien2kMPI performance

Tue Jun 7 07:48:11 CEST 2011

Just adding three more comments to what Robert said:

memory in lapw0: depends mainly on GMAX (in2) and/or   IFFT-parameters and enhancement factor in case.in0
                  (memory critical for large FFT grids (enhancement factors), parallelization solves the problem)

           lapw1: SCALAPACK diagonalization needs significantly more memory then sequential LAPACK.
                  (instead of (NMAT*(NMAT+1)/2) you need NMAT**2 for H and S, plus additional large auxiliary arrays.
                   iterative diagonalization needs another large NMAT**2 array + vectors (NMAT*NUME)

           lapw2: in most cases the real memory critical step !!!! There are many cases where lapw1 still
                  does fine in terms of memory, but lapw2 does NOT !!! Solve it by using
                  lapw2_vector_split: 2  (or even 4) in  .machines file

Check the parallel case.output* files to get an idea about memory allocation.

Am 07.06.2011 08:31, schrieb Robert Laskowski:
> Hi,
> lapw0 is parallelized in loop over atoms, there is little communication here,
> and fftw, for that look at fftw manual. For lapw1, setup has no communication
> at all, eigensolver is done with pblas and scalapack calls, here both latency
> and bandwidth are important, but these libraries should be well optimized, I
> would point more to bandwidth. lapw2 uses two coexisting communicators, one
> for parallelization vs atoms, and the other for splitting vector
> (lapw2_vector_split  in .machines), for large systems you have to split
> vector. I guess that here major time is used in pblas calls, which are
> matrix/matrix multiplications, however on some old and less efficient systems
> we have notice huge time spend on reading and distributing the vector file.
>
> regards
>
> Robert
>
>
> On Monday 06 June 2011 23:31:44 Kevin Jorissen wrote:
>> Dear wien2k community,
>>
>> I have a few basic questions regarding the MPI/SCALAPACK version of wien2k
>> :
>>
>> * does anyone have a formula for calculating the memory requirements
>> of the code (lapw0/1/2) given, say, nmat and nume and the number of
>> cores used?  It's easy enough for the serial code, but I'm sometimes
>> baffled by the memory taken by each of the MPI threads when
>> distributing the job over N cores.  It's sometimes very different from
>> [serial size in GB] / N_cores.  It makes the queue manager unhappy,
>> and occasionally I unintentionally overload a node this way.
>>
>> * I was asked the following question about the MPI wien2k code :
>>>> So would it be correct to state that your apps are more bandwidth
>>>> sensitive than latency sensitive?
>>
>> and I don't know what to answer.  Thinking about LARGE calculations
>> (hundreds of atoms) I want to say that both will be important ...
>> Does anyone have a more sophisticated insight here?
>>
>>
>>
>> cheers,
>>
>>
>> Kevin Jorissen
>> University of Washington
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-15671             FAX: +43-1-58801-15698
Email: blaha at theochem.tuwien.ac.at    WWW: http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------