[Wien] problem in parallel calculations

Fri Apr 20 09:13:10 CEST 2012

When you have very little experience, the first thing to do is:

Forget mpi-parallelization (the problem is probably with scalapack (since lapw0_mpi seems to run in your second example)
or simply try setting     setenv MPI_REMOTE 0    in $WIENROOT/parallel_options

lets focus on sequential runs only.

A system with just 8 atoms should run only a couple of seconds (although with all your debugging switches on
it will take a bit longer).

 > lapw0:tachyon1119:8 tachyon2665:8 tachyon3150:8 tachyon2896:8 tachyon1519:8 tachyon2673:8
lapw0 still runs in mpi-mode and needs 17 seconds.

 >  > lapw0 -p (18:02:36) starting parallel lapw0 at Wed Apr 18 18:02:36 KST 2012
 > -------- .machine0 : 48 processors
 > 83.154u 13.907s 0:17.04 569.5% 0+0k 0+0io 1899pf+0w

However: you have only 8 atoms and lapw0 is parallelized only over atoms. Thus use ONLY 8 cores for this run.
You will see that the time decreases, since the fft-part is probably very slow with so many cores.

--------
 > This shows LAPW1 cycle runs with 6 nodes and each node calculates with 8 k-point.
 > The time is almost 1:30 hour!
 > My system contains only 9 Bi atoms and the inversion symmetry is assumed so the number of symmetry operator is 4. So I expected much less time than those.

Yes, this is VERY strange. Part of it may come from ??? -mcmodel=medium -CB -g ??? which should probably not
be there in production runs.

Do a    grep HORB  case.output1_1

It should give you some cpu/wall time information on 3 different parts of the code.
and also    grep :RKM case.scf1     will tell you what your matrix size is (for 8 atoms it should
not be larger than ~1000-2000  and the cpu time/k-point should be in the range of 10-30 seconds/k-point

> Compiler option
> O Compiler options: -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include
> L Linker Flags: $(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread
> P Preprocessor flags '-DParallel'
> R R_LIB (LAPACK+BLAS): -lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide
>
> RP RP_LIB(SCALAPACK+PBLAS): -lmkl_scalapack_lp64 -lmkl_solver_lp64 -lmkl_blacs_lp64 -L$(FFTWPATH)/lib -lfftw_mpi -lfftw $(R_LIBS)
> FP FPOPT(par.comp.options): -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include
> MP MPIRUN commando : mpirun -mca btl ^tcp -mca plm_rsh_num_concurrent 48 -mca oob_tcp_listen_mode listen_thread -mca plm_rsh_tree_spawn 1 -np _NP_ -machinefile _HOSTS_ _EXEC_
>
>
>
> Within this environment, the compilation goes without any error messages.

> case2 : k-point parallelism
> *2. Only for the case of k-point parallelism, in this case, I just put the total number of cpu as 384/8=48. *
>
> The generated .machine file is
>
> lapw0:tachyon1119:8 tachyon2665:8 tachyon3150:8 tachyon2896:8 tachyon1519:8 tachyon2673:8
> 1:tachyon1119
> 1:tachyon2665
> 1:tachyon3150
> 1:tachyon2896
> 1:tachyon1519
> 1:tachyon2673
> granularity:1
> extrafine:1
> lapw2_vector_split:1
>
> And the generated .processes file is
> init:tachyon1119
> init:tachyon2665
> init:tachyon3150
> init:tachyon2896
> init:tachyon1519
> init:tachyon2673
> 1 : tachyon1119 : 8 : 1 : 1
> 2 : tachyon2665 : 8 : 1 : 2
> 3 : tachyon3150 : 8 : 1 : 3
> 4 : tachyon2896 : 8 : 1 : 4
> 5 : tachyon1519 : 8 : 1 : 5
> 6 : tachyon2673 : 8 : 1 : 6
>
> And the calculation is going smooth until it gets the time limitation. But the problem is time consuming.
> Below the .dayfile is presented.
>
> start (Wed Apr 18 18:02:36 KST 2012) with lapw0 (40/99 to go)
>
> cycle 1 (Wed Apr 18 18:02:36 KST 2012) (40/99 to go)
>
>  > lapw0 -p (18:02:36) starting parallel lapw0 at Wed Apr 18 18:02:36 KST 2012
> -------- .machine0 : 48 processors
> 83.154u 13.907s 0:17.04 569.5% 0+0k 0+0io 1899pf+0w
> :FORCE convergence: 0 1 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO
>  > lapw1 -p (18:02:55) starting parallel lapw1 at Wed Apr 18 18:02:55 KST 2012
> -> starting parallel LAPW1 jobs at Wed Apr 18 18:02:55 KST 2012
> running LAPW1 in parallel mode (using .machines)
> 6 number_of_parallel_jobs
> tachyon1001(8) 5167.781u 7.573s 1:26:16.75 99.9% 0+0k 0+0io 175pf+0w
> tachyon1469(8) 5222.568u 8.425s 1:27:12.71 99.9% 0+0k 0+0io 0pf+0w
> tachyon2585(8) 5148.924u 7.837s 1:25:58.19 99.9% 0+0k 0+0io 19pf+0w
> tachyon1214(8) 5170.790u 5.684s 1:26:17.83 99.9% 0+0k 0+0io 0pf+0w
> tachyon2943(8) 5105.165u 5.959s 1:25:12.38 99.9% 0+0k 0+0io 0pf+0w
> tachyon1154(8) 5065.181u 6.282s 1:24:32.88 99.9% 0+0k 0+0io 74pf+0w
> Summary of lapw1para:
> tachyon1001 k=8 user=5167.78 wallclock=86
> tachyon1469 k=8 user=5222.57 wallclock=87
> tachyon2585 k=8 user=5148.92 wallclock=85
> tachyon1214 k=8 user=5170.79 wallclock=86
> tachyon2943 k=8 user=5105.16 wallclock=85
> tachyon1154 k=8 user=5065.18 wallclock=84
> 30883.253u 49.709s 1:27:15.42 590.8% 0+0k 0+0io 276pf+0w
>
> This shows LAPW1 cycle runs with 6 nodes and each node calculates with 8 k-point.
> The time is almost 1:30 hour!
> My system contains only 9 Bi atoms and the inversion symmetry is assumed so the number of symmetry operator is 4. So I expected much less time than those.
> Can anybody help me?
>
> Sincerely,
>
> HJ Kim.
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WWW: http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------