[Wien] problem in parallel calculations

Fri Apr 20 15:19:00 CEST 2012

Let me expand slightly on my response, as well as Peter's longer (and
better) one.

In your particular case your problem probably relates to some
incorrect parameters in your MPIRUN command. Please go to the file
$WIENROOT/parallel_options and change it to:

setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH -np _NP_
-machinefile _HOSTS_ _EXEC_"

and do the same change for $WIENROOT/OPTIONS .

Many of the addition parameters you have included should not be
needed, and are probably creating problems. In addition, there appear
to be some communication issues although these might just be because
you have incorrect parameters.

In addition, -CB -g in your compilation option makes things much slower.

mpi is useful for larger problems, not so much for small ones. Hence
you should start by making sure that you understand how to run non-mpi
problems first, and go through the examples in the user guide as well
as others in the examples directory. There are also some lecture notes
from the last Wien2k school at
http://www.wien2k.at/reg_user/textbooks/WIEN2k_lecture-notes_2011/

Using mpi is more complicated since most system administrators expect
a different structure for mpi than what Wien2k has, for instance they
expect the user to just create one mpi job. In addition, Wien2k is
much more demanding and requires a correctly installed and bug free
mpi. (Some other codes do not.) Quite a few times I have had problems
with mpi installations, particularly openmpi and as I said in my
previous email 1.3.3 might be a broken version.

A general approach is to go to the Wien2k benchmark page
(http://www.wien2k.at/reg_user/benchmark/) and download both the
serial and parallel benchmarks. Try different combinations to get an
idea of the speed for your system of different combinations.

You may also find the utilities in  SRC_mpiutil at
http://www.wien2k.at/reg_user/unsupported useful.

2012/4/19 hyunjung kim <angpangmokjang at hanmail.net>:
> Dear all,
>
> It has been almost 1 month since I have been tried to make parallel
> calculations.
>
> Im working on
> model : SUN Blade 6275 clusters
> Processor: Intel Xeon X5570
> CPU/node : 8cpu
> Memory : 24GB/node, 3GB/core
> Network: Infiniband 40G 8X QDR
> Operation: Redhat Enterprise Linux 5.3
> Job control : SGE 6.2u5
>
> Compiler : intel 11.1 (MKL therein)
> MPI : openMPI 1.3.3
> FFTW: 2.1.5 (FFTW was compiled with intel 11.1 and configured with
> --enable-mpi LDFLAGS=-L$MPIHOME/$LIBRARYPATH F77=ifort
> CC=icc --with-sgi-mp --with-openmp --enable-threads)
>
>
> Compiler option
>  O   Compiler options:        -FR -mp1 -w -prec_div -pc80 -pad -ip
> -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback
> -I$(MKLROOT)/include
>  L   Linker Flags:            $(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH)
> -pthread
>  P   Preprocessor flags       '-DParallel'
>  R   R_LIB (LAPACK+BLAS):     -lmkl_lapack -lmkl_intel_lp64
> -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide
>
>  RP  RP_LIB(SCALAPACK+PBLAS): -lmkl_scalapack_lp64 -lmkl_solver_lp64
> -lmkl_blacs_lp64 -L$(FFTWPATH)/lib -lfftw_mpi -lfftw $(R_LIBS)
>  FP  FPOPT(par.comp.options): -FR -mp1 -w -prec_div -pc80 -pad -ip
> -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback
> -I$(MKLROOT)/include
>  MP  MPIRUN commando        : mpirun -mca btl ^tcp -mca
> plm_rsh_num_concurrent 48 -mca oob_tcp_listen_mode listen_thread -mca
> plm_rsh_tree_spawn 1 -np _NP_ -machinefile _HOSTS_ _EXEC_
>
>
>
> Within this environment, the compilation goes without any error messages.
>
> To make .machines file, I type "proclist=(`cat $TMPDIR/machines`)".
> It gives me the list of nodes according to the number of cpu.
> If I set the total number of cpu 384 in the jobscript file, it export 384
> result.
> Since it exports the name of each nodes, there is 8 same node.
>
> case1 : k-point parallelism + 8 mpi task per k-point
> 1. In my case, I owing to calculate with 48 k-points and 8 mpi tasks per
> node(per k-points), the machine file was,
>
> lapw0:tachyon2066:8 tachyon1982:8 tachyon1207:8 tachyon1396:8 tachyon1152:8
> tachyon2440:8 tachyon2120:8 tachyon1555:8 tachyoo
> n2319:8 tachyon2470:8 tachyon1612:8 tachyon2274:8 tachyon1402:8
> tachyon2846:8 tachyon2091:8 tachyon1622:8 tachyon1920:8 tachh
> yon2213:8 tachyon1832:8 tachyon2672:8 tachyon2370:8 tachyon2545:8
> tachyon2359:8 tachyon1770:8 tachyon1018:8 tachyon1456:8 taa
> chyon1429:8 tachyon3074:8 tachyon1169:8 tachyon2400:8 tachyon2688:8
> tachyon1099:8 tachyon2906:8 tachyon1394:8 tachyon1830:8
> tachyon1383:8 tachyon2157:8 tachyon2818:8 tachyon2644:8 tachyon2283:8
> tachyon1213:8 tachyon1542:8 tachyon2726:8 tachyon2152::
> 8 tachyon1135:8 tachyon2144:8 tachyon3015:8 tachyon2077:8
> 1:tachyon2066:8
> 1:tachyon1982:8
> 1:tachyon1207:8
> 1:tachyon1396:8
> 1:tachyon1152:8
> 1:tachyon2440:8
> 1:tachyon2120:8
> 1:tachyon1555:8
> 1:tachyon2319:8
> 1:tachyon2470:8
> 1:tachyon1612:8
> 1:tachyon2274:8
> 1:tachyon1402:8
> 1:tachyon2846:8
> 1:tachyon2091:8
> 1:tachyon1622:8
> 1:tachyon1920:8
> 1:tachyon2213:8
> 1:tachyon1832:8
> 1:tachyon2672:8
> 1:tachyon2370:8
> 1:tachyon2545:8
> 1:tachyon2359:8
> 1:tachyon1770:8
> 1:tachyon1018:8
> 1:tachyon1456:8
> 1:tachyon1429:8
> 1:tachyon3074:8
> 1:tachyon1169:8
> 1:tachyon2400:8
> 1:tachyon2688:8
> 1:tachyon1099:8
> 1:tachyon2906:8
> 1:tachyon1394:8
> 1:tachyon1830:8
> 1:tachyon1383:8
> 1:tachyon2157:8
> 1:tachyon2818:8
> 1:tachyon2644:8
> 1:tachyon2283:8
> 1:tachyon1213:8
> 1:tachyon1542:8
> 1:tachyon2726:8
> 1:tachyon2152:8
> 1:tachyon1135:8
> 1:tachyon2144:8
> 1:tachyon3015:8
> 1:tachyon2077:8
> granularity:1
> extrafine:1
> lapw2_vector_split:1
>
> In this case,
>
> case.dayfile shows
>
> on tachyon2066 with PID 13780
> using WIEN2k_11.1 (Release 5/4/2011) in /home01/x584cjh/code/WIEN2k_11
>
>
>     start   (Fri Apr 20 09:13:32 KST 2012) with lapw0 (40/99 to go)
>
>     cycle 1     (Fri Apr 20 09:13:32 KST 2012)  (40/99 to go)
>
>>   lapw0 -p    (09:13:32) starting parallel lapw0 at Fri Apr 20 09:13:32
>> KST 2012
> -------- .machine0 : 384 processors
> tachyon2066:14892:  open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon2066:14892:  open_hca: device mthca0 not found
> tachyon2066:14892:  open_hca: device mthca0 not found
> tachyon2066:14892:  open_hca: device ipath0 not found
> tachyon2066:14892:  open_hca: device ipath0 not found
> tachyon2066:14894:  open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon2066:14894:  open_hca: device mthca0 not found
> tachyon2066:14894:  open_hca: device mthca0 not found
> tachyon2066:14891:  open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon2066:14894:  open_hca: device ipath0 not found
> tachyon2066:14894:  open_hca: device ipath0 not found
> tachyon2319:23519:  open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon2066:14891:  open_hca: device mthca0 not found
> tachyon2066:14891:  open_hca: device mthca0 not found
> tachyon1982:11799:  open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon2319:23519:  open_hca: device mthca0 not found
> tachyon2319:23519:  open_hca: device mthca0 not found
> tachyon1982:11799:  open_hca: device mthca0 not found
> tachyon1982:11799:  open_hca: device mthca0 not found
> tachyon2066:14890:  open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon1982:11801:  open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon1982:11801:  open_hca: device mthca0 not found
> tachyon1982:11801:  open_hca: device mthca0 not found
> tachyon2066:14890:  open_hca: device mthca0 not found
> tachyon2066:14890:  open_hca: device mthca0 not found
> tachyon1982:11805:  open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon2066:14893:  open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon1982:11805:  open_hca: device mthca0 not found
> tachyon1982:11805:  open_hca: device mthca0 not found
> tachyon2066:14891:  open_hca: device ipath0 not found
> tachyon2066:14891:  open_hca: device ipath0 not found
> tachyon2066:14893:  open_hca: device mthca0 not found
> tachyon1982:11803:  open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon1152:9532:  open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon2066:14893:  open_hca: device mthca0 not found
> tachyon1982:11803:  open_hca: device mthca0 not found
> tachyon1982:11803:  open_hca: device mthca0 not found
> tachyon1982:11799:  open_hca: device ipath0 not found
> tachyon1152:9532:  open_hca: device mthca0 not found
> tachyon1152:9532:  open_hca: device mthca0 not found
> tachyon1982:11799:  open_hca: device ipath0 not found
> tachyon2319:23519:  open_hca: device ipath0 not found
> tachyon1982:11801:  open_hca: device ipath0 not found
> tachyon2319:23519:  open_hca: device ipath0 not found
> tachyon1982:11801:  open_hca: device ipath0 not found
> tachyon2066:14888:  open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon1982:11802:  open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon2066:14895:  open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> ~~~~~~~
>
> ~~~~~~~
> abbreviation
> ~~~~~~~
> tachyon2906:7201:  open_hca: device ipath0 not found
> tachyon1920:29303:  open_hca: device ipath0 not found
> tachyon1920:29303:  open_hca: device ipath0 not found
>  'Unknow' - SIGSEGV, contact developers
>  Child id           0 SIGSEGV, contact developers
> 1.198u 2.020s 2:15.54 2.3%  0+0k 0+0io 18pf+0w
> error: command   /home01/x584cjh/code/WIEN2k_11/lapw0para lapw0.def   failed
>
>>   stop error
>
>
> and created logfile is,
>
> --------------------------------------------------------------------------
> WARNING: Failed to open "OpenIB-cma-1" [DAT_INVALID_ADDRESS:].
> This may be a real error or it may be an invalid entry in the uDAPL
> Registry which is contained in the dat.conf file. Contact your local
> System Administrator to confirm the availability of the interfaces in
> the dat.conf file.
> --------------------------------------------------------------------------
> [tachyon2066:14787] 2243 more processes have sent help message
> help-mpi-btl-udapl.txt / dat_ia_open fail
> [tachyon2066:14787] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
> [tachyon2066:14787] 3 more processes have sent help message
> help-mpi-btl-udapl.txt / dat_ia_open fail
> [tachyon2066:14787] 1 more process has sent help message
> help-mpi-btl-udapl.txt / dat_ia_open fail
> [tachyon2066:14787] 2 more processes have sent help message
> help-mpi-btl-udapl.txt / dat_ia_open fail
> [tachyon2066:14787] 3 more processes have sent help message
> help-mpi-btl-udapl.txt / dat_ia_open fail
> [tachyon2066:14787] 1 more process has sent help message
> help-mpi-btl-udapl.txt / dat_ia_open fail
> w2k_dispatch_signal(): received: Segmentation fault
> *** An error occurred in MPI_Comm_f2c
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [tachyon2066:14895] Abort before MPI_INIT completed successfully; not able
> to guarantee that all other processes were killed!
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 7 with PID 14895 on
> node tachyon2066 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> ~~~~~~~
>
> ~~~~~~~
> abbreviation
> ~~~~~~~
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> [tachyon2066:14787] 2 more processes have sent help message
> help-mpi-btl-udapl.txt / dat_ia_open fail
>
>
> Which point is missing?
> What should I do?
>
> case2 : k-point parallelism
> 2. Only for the case of k-point parallelism, in this case, I just put the
> total number of cpu as 384/8=48.
>
> The generated .machine file is
>
> lapw0:tachyon1119:8 tachyon2665:8 tachyon3150:8 tachyon2896:8 tachyon1519:8
> tachyon2673:8
> 1:tachyon1119
> 1:tachyon2665
> 1:tachyon3150
> 1:tachyon2896
> 1:tachyon1519
> 1:tachyon2673
> granularity:1
> extrafine:1
> lapw2_vector_split:1
>
> And the generated .processes file is
> init:tachyon1119
> init:tachyon2665
> init:tachyon3150
> init:tachyon2896
> init:tachyon1519
> init:tachyon2673
> 1 : tachyon1119 :  8 : 1 : 1
> 2 : tachyon2665 :  8 : 1 : 2
> 3 : tachyon3150 :  8 : 1 : 3
> 4 : tachyon2896 :  8 : 1 : 4
> 5 : tachyon1519 :  8 : 1 : 5
> 6 : tachyon2673 :  8 : 1 : 6
>
> And the calculation is going smooth until it gets the time limitation. But
> the problem is time consuming.
> Below the .dayfile is presented.
>
>     start   (Wed Apr 18 18:02:36 KST 2012) with lapw0 (40/99 to go)
>
>     cycle 1     (Wed Apr 18 18:02:36 KST 2012)  (40/99 to go)
>
>>   lapw0 -p    (18:02:36) starting parallel lapw0 at Wed Apr 18 18:02:36
>> KST 2012
> -------- .machine0 : 48 processors
> 83.154u 13.907s 0:17.04 569.5%  0+0k 0+0io 1899pf+0w
> :FORCE convergence: 0 1 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0
> YCO
>>   lapw1  -p   (18:02:55) starting parallel lapw1 at Wed Apr 18 18:02:55
>> KST 2012
> ->  starting parallel LAPW1 jobs at Wed Apr 18 18:02:55 KST 2012
> running LAPW1 in parallel mode (using .machines)
> 6 number_of_parallel_jobs
>      tachyon1001(8) 5167.781u 7.573s 1:26:16.75 99.9%   0+0k 0+0io 175pf+0w
>      tachyon1469(8) 5222.568u 8.425s 1:27:12.71 99.9%   0+0k 0+0io 0pf+0w
>      tachyon2585(8) 5148.924u 7.837s 1:25:58.19 99.9%   0+0k 0+0io 19pf+0w
>      tachyon1214(8) 5170.790u 5.684s 1:26:17.83 99.9%   0+0k 0+0io 0pf+0w
>      tachyon2943(8) 5105.165u 5.959s 1:25:12.38 99.9%   0+0k 0+0io 0pf+0w
>      tachyon1154(8) 5065.181u 6.282s 1:24:32.88 99.9%   0+0k 0+0io 74pf+0w
>    Summary of lapw1para:
>    tachyon1001   k=8     user=5167.78    wallclock=86
>    tachyon1469   k=8     user=5222.57    wallclock=87
>    tachyon2585   k=8     user=5148.92    wallclock=85
>    tachyon1214   k=8     user=5170.79    wallclock=86
>    tachyon2943   k=8     user=5105.16    wallclock=85
>    tachyon1154   k=8     user=5065.18    wallclock=84
> 30883.253u 49.709s 1:27:15.42 590.8%    0+0k 0+0io 276pf+0w
>
> This shows LAPW1 cycle runs with 6 nodes and each node calculates with 8
> k-point.
> The time is almost 1:30 hour!
> My system contains only 9 Bi atoms and the inversion symmetry is assumed so
> the number of symmetry operator is 4. So I expected much less time than
> those.
>
> Can anybody help me?
>
> Sincerely,
>
> HJ Kim.
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>

-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what
nobody else has thought"
Albert Szent-Gyorgi