[Wien] problem in parallel calculations
Laurence Marks
L-marks at northwestern.edu
Fri Apr 20 03:33:10 CEST 2012
Several suggestions.
a) limit yourself to just using 8 cores on 1 cpu, and something very simple
such as TiC until that works.
b) Just use simple commands such as "x lapw0 -p" until it works.
c) You probably do not need all the additional parameters in your MPIRUN
line, they should already be set at the system level.
d) openpmi 1.3.3 is old, and may not work right.
e) Talk to your sysadmin.
---------------------------
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what nobody
else has thought"
Albert Szent-Gyorgi
On Apr 19, 2012 7:49 PM, "hyunjung kim" <angpangmokjang at hanmail.net> wrote:
> Dear all,
>
> It has been almost 1 month since I have been tried to make parallel
> calculations.
>
> Im working on
> model : SUN Blade 6275 clusters
> Processor: Intel Xeon X5570
> CPU/node : 8cpu
> Memory : 24GB/node, 3GB/core
> Network: Infiniband 40G 8X QDR
> Operation: Redhat Enterprise Linux 5.3
> Job control : SGE 6.2u5
>
> Compiler : intel 11.1 (MKL therein)
> MPI : openMPI 1.3.3
> FFTW: 2.1.5 (FFTW was compiled with intel 11.1 and configured with
> --enable-mpi LDFLAGS=-L$MPIHOME/$LIBRARYPATH F77=ifort
> CC=icc --with-sgi-mp --with-openmp --enable-threads)
>
>
> Compiler option
> O Compiler options: -FR -mp1 -w -prec_div -pc80 -pad -ip
> -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback
> -I$(MKLROOT)/include
> L Linker Flags: $(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH)
> -pthread
> P Preprocessor flags '-DParallel'
> R R_LIB (LAPACK+BLAS): -lmkl_lapack -lmkl_intel_lp64
> -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide
>
> RP RP_LIB(SCALAPACK+PBLAS): -lmkl_scalapack_lp64 -lmkl_solver_lp64
> -lmkl_blacs_lp64 -L$(FFTWPATH)/lib -lfftw_mpi -lfftw $(R_LIBS)
> FP FPOPT(par.comp.options): -FR -mp1 -w -prec_div -pc80 -pad -ip
> -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback
> -I$(MKLROOT)/include
> MP MPIRUN commando : mpirun -mca btl ^tcp -mca
> plm_rsh_num_concurrent 48 -mca oob_tcp_listen_mode listen_thread -mca
> plm_rsh_tree_spawn 1 -np _NP_ -machinefile _HOSTS_ _EXEC_
>
>
>
> Within this environment, the compilation goes without any error messages.
>
> To make .machines file, I type "proclist=(`cat $TMPDIR/machines`)".
> It gives me the list of nodes according to the number of cpu.
> If I set the total number of cpu 384 in the jobscript file, it export 384
> result.
> Since it exports the name of each nodes, there is 8 same node.
>
> *case1 : k-point parallelism + 8 mpi task per k-point*
> *1. In my case, I owing to calculate with 48 k-points and 8 mpi tasks per
> node(per k-points), the machine file was,*
>
> lapw0:tachyon2066:8 tachyon1982:8 tachyon1207:8 tachyon1396:8
> tachyon1152:8 tachyon2440:8 tachyon2120:8 tachyon1555:8 tachyoo
> n2319:8 tachyon2470:8 tachyon1612:8 tachyon2274:8 tachyon1402:8
> tachyon2846:8 tachyon2091:8 tachyon1622:8 tachyon1920:8 tachh
> yon2213:8 tachyon1832:8 tachyon2672:8 tachyon2370:8 tachyon2545:8
> tachyon2359:8 tachyon1770:8 tachyon1018:8 tachyon1456:8 taa
> chyon1429:8 tachyon3074:8 tachyon1169:8 tachyon2400:8 tachyon2688:8
> tachyon1099:8 tachyon2906:8 tachyon1394:8 tachyon1830:8
> tachyon1383:8 tachyon2157:8 tachyon2818:8 tachyon2644:8 tachyon2283:8
> tachyon1213:8 tachyon1542:8 tachyon2726:8 tachyon2152::
> 8 tachyon1135:8 tachyon2144:8 tachyon3015:8 tachyon2077:8
> 1:tachyon2066:8
> 1:tachyon1982:8
> 1:tachyon1207:8
> 1:tachyon1396:8
> 1:tachyon1152:8
> 1:tachyon2440:8
> 1:tachyon2120:8
> 1:tachyon1555:8
> 1:tachyon2319:8
> 1:tachyon2470:8
> 1:tachyon1612:8
> 1:tachyon2274:8
> 1:tachyon1402:8
> 1:tachyon2846:8
> 1:tachyon2091:8
> 1:tachyon1622:8
> 1:tachyon1920:8
> 1:tachyon2213:8
> 1:tachyon1832:8
> 1:tachyon2672:8
> 1:tachyon2370:8
> 1:tachyon2545:8
> 1:tachyon2359:8
> 1:tachyon1770:8
> 1:tachyon1018:8
> 1:tachyon1456:8
> 1:tachyon1429:8
> 1:tachyon3074:8
> 1:tachyon1169:8
> 1:tachyon2400:8
> 1:tachyon2688:8
> 1:tachyon1099:8
> 1:tachyon2906:8
> 1:tachyon1394:8
> 1:tachyon1830:8
> 1:tachyon1383:8
> 1:tachyon2157:8
> 1:tachyon2818:8
> 1:tachyon2644:8
> 1:tachyon2283:8
> 1:tachyon1213:8
> 1:tachyon1542:8
> 1:tachyon2726:8
> 1:tachyon2152:8
> 1:tachyon1135:8
> 1:tachyon2144:8
> 1:tachyon3015:8
> 1:tachyon2077:8
> granularity:1
> extrafine:1
> lapw2_vector_split:1
>
> In this case,
>
> case.dayfile shows
>
> on tachyon2066 with PID 13780
> using WIEN2k_11.1 (Release 5/4/2011) in /home01/x584cjh/code/WIEN2k_11
>
>
> start (Fri Apr 20 09:13:32 KST 2012) with lapw0 (40/99 to go)
>
> cycle 1 (Fri Apr 20 09:13:32 KST 2012) (40/99 to go)
>
> > lapw0 -p (09:13:32) starting parallel lapw0 at Fri Apr 20 09:13:32
> KST 2012
> -------- .machine0 : 384 processors
> tachyon2066:14892: open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon2066:14892: open_hca: device mthca0 not found
> tachyon2066:14892: open_hca: device mthca0 not found
> tachyon2066:14892: open_hca: device ipath0 not found
> tachyon2066:14892: open_hca: device ipath0 not found
> tachyon2066:14894: open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon2066:14894: open_hca: device mthca0 not found
> tachyon2066:14894: open_hca: device mthca0 not found
> tachyon2066:14891: open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon2066:14894: open_hca: device ipath0 not found
> tachyon2066:14894: open_hca: device ipath0 not found
> tachyon2319:23519: open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon2066:14891: open_hca: device mthca0 not found
> tachyon2066:14891: open_hca: device mthca0 not found
> tachyon1982:11799: open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon2319:23519: open_hca: device mthca0 not found
> tachyon2319:23519: open_hca: device mthca0 not found
> tachyon1982:11799: open_hca: device mthca0 not found
> tachyon1982:11799: open_hca: device mthca0 not found
> tachyon2066:14890: open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon1982:11801: open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon1982:11801: open_hca: device mthca0 not found
> tachyon1982:11801: open_hca: device mthca0 not found
> tachyon2066:14890: open_hca: device mthca0 not found
> tachyon2066:14890: open_hca: device mthca0 not found
> tachyon1982:11805: open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon2066:14893: open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon1982:11805: open_hca: device mthca0 not found
> tachyon1982:11805: open_hca: device mthca0 not found
> tachyon2066:14891: open_hca: device ipath0 not found
> tachyon2066:14891: open_hca: device ipath0 not found
> tachyon2066:14893: open_hca: device mthca0 not found
> tachyon1982:11803: open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon1152:9532: open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon2066:14893: open_hca: device mthca0 not found
> tachyon1982:11803: open_hca: device mthca0 not found
> tachyon1982:11803: open_hca: device mthca0 not found
> tachyon1982:11799: open_hca: device ipath0 not found
> tachyon1152:9532: open_hca: device mthca0 not found
> tachyon1152:9532: open_hca: device mthca0 not found
> tachyon1982:11799: open_hca: device ipath0 not found
> tachyon2319:23519: open_hca: device ipath0 not found
> tachyon1982:11801: open_hca: device ipath0 not found
> tachyon2319:23519: open_hca: device ipath0 not found
> tachyon1982:11801: open_hca: device ipath0 not found
> tachyon2066:14888: open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon1982:11802: open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> tachyon2066:14895: open_hca: getaddr_netdev ERROR: Connection refused. Is
> ib1 configured?
> ~~~~~~~
>
> ~~~~~~~
> abbreviation
> ~~~~~~~
> tachyon2906:7201: open_hca: device ipath0 not found
> tachyon1920:29303: open_hca: device ipath0 not found
> tachyon1920:29303: open_hca: device ipath0 not found
> 'Unknow' - SIGSEGV, contact developers
> Child id 0 SIGSEGV, contact developers
> 1.198u 2.020s 2:15.54 2.3% 0+0k 0+0io 18pf+0w
> error: command /home01/x584cjh/code/WIEN2k_11/lapw0para lapw0.def
> failed
>
> > stop error
>
>
> and created logfile is,
>
> --------------------------------------------------------------------------
> WARNING: Failed to open "OpenIB-cma-1" [DAT_INVALID_ADDRESS:].
> This may be a real error or it may be an invalid entry in the uDAPL
> Registry which is contained in the dat.conf file. Contact your local
> System Administrator to confirm the availability of the interfaces in
> the dat.conf file.
> --------------------------------------------------------------------------
> [tachyon2066:14787] 2243 more processes have sent help message
> help-mpi-btl-udapl.txt / dat_ia_open fail
> [tachyon2066:14787] Set MCA parameter "orte_base_help_aggregate" to 0 to
> see all help / error messages
> [tachyon2066:14787] 3 more processes have sent help message
> help-mpi-btl-udapl.txt / dat_ia_open fail
> [tachyon2066:14787] 1 more process has sent help message
> help-mpi-btl-udapl.txt / dat_ia_open fail
> [tachyon2066:14787] 2 more processes have sent help message
> help-mpi-btl-udapl.txt / dat_ia_open fail
> [tachyon2066:14787] 3 more processes have sent help message
> help-mpi-btl-udapl.txt / dat_ia_open fail
> [tachyon2066:14787] 1 more process has sent help message
> help-mpi-btl-udapl.txt / dat_ia_open fail
> w2k_dispatch_signal(): received: Segmentation fault
> *** An error occurred in MPI_Comm_f2c
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [tachyon2066:14895] Abort before MPI_INIT completed successfully; not able
> to guarantee that all other processes were killed!
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 7 with PID 14895 on
> node tachyon2066 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> ~~~~~~~
>
> ~~~~~~~
> abbreviation
> ~~~~~~~
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> [tachyon2066:14787] 2 more processes have sent help message
> help-mpi-btl-udapl.txt / dat_ia_open fail
>
>
> Which point is missing?
> What should I do?
>
> case2 : k-point parallelism
> *2. Only for the case of k-point parallelism, in this case, I just put
> the total number of cpu as 384/8=48. *
>
> The generated .machine file is
>
> lapw0:tachyon1119:8 tachyon2665:8 tachyon3150:8 tachyon2896:8
> tachyon1519:8 tachyon2673:8
> 1:tachyon1119
> 1:tachyon2665
> 1:tachyon3150
> 1:tachyon2896
> 1:tachyon1519
> 1:tachyon2673
> granularity:1
> extrafine:1
> lapw2_vector_split:1
>
> And the generated .processes file is
> init:tachyon1119
> init:tachyon2665
> init:tachyon3150
> init:tachyon2896
> init:tachyon1519
> init:tachyon2673
> 1 : tachyon1119 : 8 : 1 : 1
> 2 : tachyon2665 : 8 : 1 : 2
> 3 : tachyon3150 : 8 : 1 : 3
> 4 : tachyon2896 : 8 : 1 : 4
> 5 : tachyon1519 : 8 : 1 : 5
> 6 : tachyon2673 : 8 : 1 : 6
>
> And the calculation is going smooth until it gets the time limitation. But
> the problem is time consuming.
> Below the .dayfile is presented.
>
> start (Wed Apr 18 18:02:36 KST 2012) with lapw0 (40/99 to go)
>
> cycle 1 (Wed Apr 18 18:02:36 KST 2012) (40/99 to go)
>
> > lapw0 -p (18:02:36) starting parallel lapw0 at Wed Apr 18 18:02:36
> KST 2012
> -------- .machine0 : 48 processors
> 83.154u 13.907s 0:17.04 569.5% 0+0k 0+0io 1899pf+0w
> :FORCE convergence: 0 1 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0
> YCO
> > lapw1 -p (18:02:55) starting parallel lapw1 at Wed Apr 18 18:02:55
> KST 2012
> -> starting parallel LAPW1 jobs at Wed Apr 18 18:02:55 KST 2012
> running LAPW1 in parallel mode (using .machines)
> 6 number_of_parallel_jobs
> tachyon1001(8) 5167.781u 7.573s 1:26:16.75 99.9% 0+0k 0+0io 175pf+0w
> tachyon1469(8) 5222.568u 8.425s 1:27:12.71 99.9% 0+0k 0+0io 0pf+0w
> tachyon2585(8) 5148.924u 7.837s 1:25:58.19 99.9% 0+0k 0+0io 19pf+0w
> tachyon1214(8) 5170.790u 5.684s 1:26:17.83 99.9% 0+0k 0+0io 0pf+0w
> tachyon2943(8) 5105.165u 5.959s 1:25:12.38 99.9% 0+0k 0+0io 0pf+0w
> tachyon1154(8) 5065.181u 6.282s 1:24:32.88 99.9% 0+0k 0+0io 74pf+0w
> Summary of lapw1para:
> tachyon1001 k=8 user=5167.78 wallclock=86
> tachyon1469 k=8 user=5222.57 wallclock=87
> tachyon2585 k=8 user=5148.92 wallclock=85
> tachyon1214 k=8 user=5170.79 wallclock=86
> tachyon2943 k=8 user=5105.16 wallclock=85
> tachyon1154 k=8 user=5065.18 wallclock=84
> 30883.253u 49.709s 1:27:15.42 590.8% 0+0k 0+0io 276pf+0w
>
> This shows LAPW1 cycle runs with 6 nodes and each node calculates with 8
> k-point.
> The time is almost 1:30 hour!
> My system contains only 9 Bi atoms and the inversion symmetry is assumed
> so the number of symmetry operator is 4. So I expected much less time than
> those.
>
> Can anybody help me?
>
> Sincerely,
>
> HJ Kim.
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20120419/4407fba0/attachment-0001.htm>
More information about the Wien
mailing list