[Wien] MPI vs multi-thread?
Peter Blaha
pblaha at theochem.tuwien.ac.at
Wed Jan 13 08:57:36 CET 2016
It is not so easy to give unique answers to this question, as the
performance depends on:
a) your case (size of the problem)
b) your specific hardware (in particular network speed)
c) your mpi and mkl-software (version).
In my experience (but see the above remarks), and this is what is
clearly written in the UG about parallelization:
I run small cases (usually below 50 atoms/cell) on a simple local PC
cluster with Gigabit network on about 10-50 cores (depending on size and
k-points). For these cases I use k-parallelism and OMP_NUM_THREAD=2.
OMP_NUM_THREAD=4 gives for me a very small performance increase, so I do
not use it (maybe with the latest mkl ... ??), but I never experienced a
"crash" after 2 cycles ???
I run larger cases (where the matrix size is too big for a single
computer) on a big cluster with 16 core nodes and Infiniband and queuing
system. The MINIMUM number of mpi-jobs is 16 (below it is usually
useless), but for cases with a couple of hundredths of atoms/cell I also
used up to 512 cores. Often I couple k-point parallel (usually we have
only 1-8 k-points for such large cells) and mpi-parallelism.
Final remarks:
On a Gigabit network mpi-parallel is "useless".
The mpi-parallel version is about a factor of 2 "slower" and takes 2x as
much memory as the sequential code. Thus you need a "sizable" number of
cores. Therefore mpi on a single "quadcore"-cpu is also not very useful.
And for large cases, ALWAYS use "iterative diagonalization" (and an
"adapted (optimized)" RKMAX and k-point mesh, otherwise calculations
will run "forever"!!
On 01/13/2016 08:25 AM, Hu, Wenhao wrote:
> Hi, all:
>
> I met some confusions when I try to compare the efficiency of MPI and multi-thread calculations. In the lapw1 stage of the same case, I found that MPI will take double time of that with multi-thread. Other than, it even takes longer time than k-point parallelization without multi-thread setup. Can anyone tell me under what case MPI has a better performance? Another question is about the number of thread per job. When I increase the OMP_NUM_THREADS from 2 to 4, my process usually crashes after two cycles although it does have a boost effect on the finished cycle. Is this a normal thing? Do we have an optimal threads number?
>
> Best,
> Wenhao
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
--
P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at WIEN2k: http://www.wien2k.at
WWW: http://www.imc.tuwien.ac.at/staff/tc_group_e.php
--------------------------------------------------------------------------
More information about the Wien
mailing list