[Wien] problem in parallel calculations

hyunjung kim angpangmokjang at hanmail.net
Fri Apr 20 02:40:08 CEST 2012


Dear all,

It has been almost 1 month since I have been tried to make parallel calculations.

Im working on 
model : SUN Blade 6275 clusters
Processor: Intel Xeon X5570
CPU/node : 8cpu
Memory : 24GB/node, 3GB/core
Network: Infiniband 40G 8X QDR
Operation: Redhat Enterprise Linux 5.3
Job control : SGE 6.2u5

Compiler : intel 11.1 (MKL therein)
	MPI : openMPI 1.3.3
	FFTW: 2.1.5 (FFTW was compiled with intel 11.1 and configured with --enable-mpi LDFLAGS=-L$MPIHOME/$LIBRARYPATH F77=ifort CC=icc --with-sgi-mp --with-openmp --enable-threads)


Compiler option
 O   Compiler options:        -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include
 L   Linker Flags:            $(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread
 P   Preprocessor flags       '-DParallel'
 R   R_LIB (LAPACK+BLAS):     -lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide

 RP  RP_LIB(SCALAPACK+PBLAS): -lmkl_scalapack_lp64 -lmkl_solver_lp64 -lmkl_blacs_lp64 -L$(FFTWPATH)/lib -lfftw_mpi -lfftw $(R_LIBS)
 FP  FPOPT(par.comp.options): -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include
 MP  MPIRUN commando        : mpirun -mca btl ^tcp -mca plm_rsh_num_concurrent 48 -mca oob_tcp_listen_mode listen_thread -mca plm_rsh_tree_spawn 1 -np _NP_ -machinefile _HOSTS_ _EXEC_



Within this environment, the compilation goes without any error messages.

To make .machines file, I type "proclist=(`cat $TMPDIR/machines`)".
It gives me the list of nodes according to the number of cpu.
If I set the total number of cpu 384 in the jobscript file, it export 384 result.
Since it exports the name of each nodes, there is 8 same node. 

case1 : k-point parallelism + 8 mpi task per k-point
1. In my case, I owing to calculate with 48 k-points and 8 mpi tasks per node(per k-points), the machine file was,

lapw0:tachyon2066:8 tachyon1982:8 tachyon1207:8 tachyon1396:8 tachyon1152:8 tachyon2440:8 tachyon2120:8 tachyon1555:8 tachyoo
n2319:8 tachyon2470:8 tachyon1612:8 tachyon2274:8 tachyon1402:8 tachyon2846:8 tachyon2091:8 tachyon1622:8 tachyon1920:8 tachh
yon2213:8 tachyon1832:8 tachyon2672:8 tachyon2370:8 tachyon2545:8 tachyon2359:8 tachyon1770:8 tachyon1018:8 tachyon1456:8 taa
chyon1429:8 tachyon3074:8 tachyon1169:8 tachyon2400:8 tachyon2688:8 tachyon1099:8 tachyon2906:8 tachyon1394:8 tachyon1830:8  
tachyon1383:8 tachyon2157:8 tachyon2818:8 tachyon2644:8 tachyon2283:8 tachyon1213:8 tachyon1542:8 tachyon2726:8 tachyon2152::
8 tachyon1135:8 tachyon2144:8 tachyon3015:8 tachyon2077:8
1:tachyon2066:8
1:tachyon1982:8
1:tachyon1207:8
1:tachyon1396:8
1:tachyon1152:8
1:tachyon2440:8
1:tachyon2120:8
1:tachyon1555:8
1:tachyon2319:8
1:tachyon2470:8
1:tachyon1612:8
1:tachyon2274:8
1:tachyon1402:8
1:tachyon2846:8
1:tachyon2091:8
1:tachyon1622:8
1:tachyon1920:8
1:tachyon2213:8
1:tachyon1832:8
1:tachyon2672:8
1:tachyon2370:8
1:tachyon2545:8
1:tachyon2359:8
1:tachyon1770:8
1:tachyon1018:8
1:tachyon1456:8
1:tachyon1429:8
1:tachyon3074:8
1:tachyon1169:8
1:tachyon2400:8
1:tachyon2688:8
1:tachyon1099:8
1:tachyon2906:8
1:tachyon1394:8
1:tachyon1830:8
1:tachyon1383:8
1:tachyon2157:8
1:tachyon2818:8
1:tachyon2644:8
1:tachyon2283:8
1:tachyon1213:8
1:tachyon1542:8
1:tachyon2726:8
1:tachyon2152:8
1:tachyon1135:8
1:tachyon2144:8
1:tachyon3015:8
1:tachyon2077:8
granularity:1
extrafine:1
lapw2_vector_split:1

In this case, 

case.dayfile shows

on tachyon2066 with PID 13780
using WIEN2k_11.1 (Release 5/4/2011) in /home01/x584cjh/code/WIEN2k_11


    start   (Fri Apr 20 09:13:32 KST 2012) with lapw0 (40/99 to go)

    cycle 1     (Fri Apr 20 09:13:32 KST 2012)  (40/99 to go)

>   lapw0 -p    (09:13:32) starting parallel lapw0 at Fri Apr 20 09:13:32 KST 2012
-------- .machine0 : 384 processors
tachyon2066:14892:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?
tachyon2066:14892:  open_hca: device mthca0 not found
tachyon2066:14892:  open_hca: device mthca0 not found
tachyon2066:14892:  open_hca: device ipath0 not found
tachyon2066:14892:  open_hca: device ipath0 not found
tachyon2066:14894:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?
tachyon2066:14894:  open_hca: device mthca0 not found
tachyon2066:14894:  open_hca: device mthca0 not found
tachyon2066:14891:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?
tachyon2066:14894:  open_hca: device ipath0 not found
tachyon2066:14894:  open_hca: device ipath0 not found
tachyon2319:23519:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?
tachyon2066:14891:  open_hca: device mthca0 not found
tachyon2066:14891:  open_hca: device mthca0 not found
tachyon1982:11799:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?
tachyon2319:23519:  open_hca: device mthca0 not found
tachyon2319:23519:  open_hca: device mthca0 not found
tachyon1982:11799:  open_hca: device mthca0 not found
tachyon1982:11799:  open_hca: device mthca0 not found
tachyon2066:14890:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?
tachyon1982:11801:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?
tachyon1982:11801:  open_hca: device mthca0 not found
tachyon1982:11801:  open_hca: device mthca0 not found
tachyon2066:14890:  open_hca: device mthca0 not found
tachyon2066:14890:  open_hca: device mthca0 not found
tachyon1982:11805:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?
tachyon2066:14893:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?
tachyon1982:11805:  open_hca: device mthca0 not found
tachyon1982:11805:  open_hca: device mthca0 not found
tachyon2066:14891:  open_hca: device ipath0 not found
tachyon2066:14891:  open_hca: device ipath0 not found
tachyon2066:14893:  open_hca: device mthca0 not found
tachyon1982:11803:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?
tachyon1152:9532:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?
tachyon2066:14893:  open_hca: device mthca0 not found
tachyon1982:11803:  open_hca: device mthca0 not found
tachyon1982:11803:  open_hca: device mthca0 not found
tachyon1982:11799:  open_hca: device ipath0 not found
tachyon1152:9532:  open_hca: device mthca0 not found
tachyon1152:9532:  open_hca: device mthca0 not found
tachyon1982:11799:  open_hca: device ipath0 not found
tachyon2319:23519:  open_hca: device ipath0 not found
tachyon1982:11801:  open_hca: device ipath0 not found
tachyon2319:23519:  open_hca: device ipath0 not found
tachyon1982:11801:  open_hca: device ipath0 not found
tachyon2066:14888:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?
tachyon1982:11802:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?
tachyon2066:14895:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?
~~~~~~~ 

~~~~~~~
abbreviation
~~~~~~~ 
tachyon2906:7201:  open_hca: device ipath0 not found
tachyon1920:29303:  open_hca: device ipath0 not found
tachyon1920:29303:  open_hca: device ipath0 not found
 'Unknow' - SIGSEGV, contact developers
 Child id           0 SIGSEGV, contact developers
1.198u 2.020s 2:15.54 2.3%  0+0k 0+0io 18pf+0w
error: command   /home01/x584cjh/code/WIEN2k_11/lapw0para lapw0.def   failed

>   stop error


and created logfile is,

--------------------------------------------------------------------------
WARNING: Failed to open "OpenIB-cma-1" [DAT_INVALID_ADDRESS:].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--------------------------------------------------------------------------
[tachyon2066:14787] 2243 more processes have sent help message help-mpi-btl-udapl.txt / dat_ia_open fail
[tachyon2066:14787] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[tachyon2066:14787] 3 more processes have sent help message help-mpi-btl-udapl.txt / dat_ia_open fail
[tachyon2066:14787] 1 more process has sent help message help-mpi-btl-udapl.txt / dat_ia_open fail
[tachyon2066:14787] 2 more processes have sent help message help-mpi-btl-udapl.txt / dat_ia_open fail
[tachyon2066:14787] 3 more processes have sent help message help-mpi-btl-udapl.txt / dat_ia_open fail
[tachyon2066:14787] 1 more process has sent help message help-mpi-btl-udapl.txt / dat_ia_open fail
w2k_dispatch_signal(): received: Segmentation fault
*** An error occurred in MPI_Comm_f2c
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[tachyon2066:14895] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
--------------------------------------------------------------------------
mpirun has exited due to process rank 7 with PID 14895 on
node tachyon2066 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
~~~~~~~ 

~~~~~~~
abbreviation
~~~~~~~ 
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
w2k_dispatch_signal(): received: Terminated
[tachyon2066:14787] 2 more processes have sent help message help-mpi-btl-udapl.txt / dat_ia_open fail


Which point is missing?
What should I do? 

case2 : k-point parallelism
2. Only for the case of k-point parallelism, in this case, I just put the total number of cpu as 384/8=48. 

The generated .machine file is

lapw0:tachyon1119:8 tachyon2665:8 tachyon3150:8 tachyon2896:8 tachyon1519:8 tachyon2673:8
1:tachyon1119
1:tachyon2665
1:tachyon3150
1:tachyon2896
1:tachyon1519
1:tachyon2673
granularity:1
extrafine:1
lapw2_vector_split:1

And the generated .processes file is
init:tachyon1119
init:tachyon2665
init:tachyon3150
init:tachyon2896
init:tachyon1519
init:tachyon2673
1 : tachyon1119 :  8 : 1 : 1
2 : tachyon2665 :  8 : 1 : 2
3 : tachyon3150 :  8 : 1 : 3
4 : tachyon2896 :  8 : 1 : 4
5 : tachyon1519 :  8 : 1 : 5
6 : tachyon2673 :  8 : 1 : 6

And the calculation is going smooth until it gets the time limitation. But the problem is time consuming.  
Below the .dayfile is presented.

    start   (Wed Apr 18 18:02:36 KST 2012) with lapw0 (40/99 to go)

    cycle 1     (Wed Apr 18 18:02:36 KST 2012)  (40/99 to go)

>   lapw0 -p    (18:02:36) starting parallel lapw0 at Wed Apr 18 18:02:36 KST 2012
-------- .machine0 : 48 processors
83.154u 13.907s 0:17.04 569.5%  0+0k 0+0io 1899pf+0w
:FORCE convergence: 0 1 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO
>   lapw1  -p   (18:02:55) starting parallel lapw1 at Wed Apr 18 18:02:55 KST 2012
->  starting parallel LAPW1 jobs at Wed Apr 18 18:02:55 KST 2012
running LAPW1 in parallel mode (using .machines)
6 number_of_parallel_jobs
     tachyon1001(8) 5167.781u 7.573s 1:26:16.75 99.9%   0+0k 0+0io 175pf+0w
     tachyon1469(8) 5222.568u 8.425s 1:27:12.71 99.9%   0+0k 0+0io 0pf+0w
     tachyon2585(8) 5148.924u 7.837s 1:25:58.19 99.9%   0+0k 0+0io 19pf+0w
     tachyon1214(8) 5170.790u 5.684s 1:26:17.83 99.9%   0+0k 0+0io 0pf+0w
     tachyon2943(8) 5105.165u 5.959s 1:25:12.38 99.9%   0+0k 0+0io 0pf+0w
     tachyon1154(8) 5065.181u 6.282s 1:24:32.88 99.9%   0+0k 0+0io 74pf+0w
   Summary of lapw1para:
   tachyon1001   k=8     user=5167.78    wallclock=86
   tachyon1469   k=8     user=5222.57    wallclock=87
   tachyon2585   k=8     user=5148.92    wallclock=85
   tachyon1214   k=8     user=5170.79    wallclock=86
   tachyon2943   k=8     user=5105.16    wallclock=85
   tachyon1154   k=8     user=5065.18    wallclock=84
30883.253u 49.709s 1:27:15.42 590.8%    0+0k 0+0io 276pf+0w

This shows LAPW1 cycle runs with 6 nodes and each node calculates with 8 k-point.
The time is almost 1:30 hour!
My system contains only 9 Bi atoms and the inversion symmetry is assumed so the number of symmetry operator is 4. So I expected much less time than those. 
 
Can anybody help me?

Sincerely,

HJ Kim.











-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20120420/ab5a77b7/attachment.htm>


More information about the Wien mailing list