[Wien] MPI vs multi-thread?

Wed Jan 13 08:57:36 CET 2016

It is not so easy to give unique answers to this question, as the 
performance depends on:
a) your case (size of the problem)
b) your specific hardware (in particular network speed)
c) your mpi and mkl-software (version).

In my experience (but see the above remarks), and this is what is 
clearly written in the UG about parallelization:

I run small cases (usually below 50 atoms/cell) on a simple local PC 
cluster with Gigabit network on about 10-50 cores (depending on size and 
k-points). For these cases I use k-parallelism and OMP_NUM_THREAD=2.
OMP_NUM_THREAD=4 gives for me a very small performance increase, so I do 
not use it (maybe with the latest mkl ... ??), but I never experienced a 
"crash" after 2 cycles ???

I run larger cases (where the matrix size is too big for a single 
computer) on a big cluster with 16 core nodes and Infiniband and queuing 
system. The MINIMUM number of mpi-jobs is 16 (below it is usually 
useless), but for cases with a couple of hundredths of atoms/cell I also 
used up to 512 cores. Often I couple k-point parallel (usually we have 
only 1-8 k-points for such large cells) and mpi-parallelism.

Final remarks:
On a Gigabit network mpi-parallel is "useless".
The mpi-parallel version is about a factor of 2 "slower" and takes 2x as 
much memory as the sequential code. Thus you need a "sizable" number of 
cores. Therefore mpi on a single "quadcore"-cpu is also not very useful.
And for large cases, ALWAYS use "iterative diagonalization" (and an 
"adapted (optimized)" RKMAX and k-point mesh, otherwise calculations 
will run "forever"!!

On 01/13/2016 08:25 AM, Hu, Wenhao wrote:
> Hi, all:
>
> I met some confusions when I try to compare the efficiency of MPI and multi-thread calculations. In the lapw1 stage of the same case, I found that MPI will take double time of that with multi-thread. Other than, it even takes longer time than k-point parallelization without multi-thread setup. Can anyone tell me under what case MPI has a better performance? Another question is about the number of thread per job. When I increase the OMP_NUM_THREADS from 2 to 4, my process usually crashes after two cycles although it does have a boost effect on the finished cycle. Is this a normal thing? Do we have an optimal threads number?
>
> Best,
> Wenhao
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/staff/tc_group_e.php
--------------------------------------------------------------------------