<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div>Dear all,</div><div><br></div><div>It has been almost 1 month since I have been tried to make parallel calculations.</div><div><br></div><div>Im working on </div><div>model : SUN Blade 6275 clusters</div><div>Processor: Intel Xeon X5570</div><div>CPU/node : 8cpu</div><div>Memory : 24GB/node, 3GB/core</div><div>Network: Infiniband 40G 8X QDR</div><div>Operation: Redhat Enterprise Linux 5.3</div><div>Job control : SGE 6.2u5</div><div><br></div><div>Compiler : intel 11.1 (MKL therein)</div><div><span class="Apple-tab-span" style="white-space:pre">        </span>MPI : openMPI 1.3.3</div><div><span class="Apple-tab-span" style="white-space:pre">        </span>FFTW: 2.1.5 (FFTW was compiled with intel 11.1 and configured with --enable-mpi LDFLAGS=-L$MPIHOME/$LIBRARYPATH F77=ifort CC=icc --with-sgi-mp --with-openmp --enable-threads)</div><div><br></div><div><br></div><div>Compiler option</div><div><div> O Compiler options: -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include</div><div> L Linker Flags: $(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread</div><div> P Preprocessor flags '-DParallel'</div><div> R R_LIB (LAPACK+BLAS): -lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide</div></div><div><br></div><div><div> RP RP_LIB(SCALAPACK+PBLAS): -lmkl_scalapack_lp64 -lmkl_solver_lp64 -lmkl_blacs_lp64 -L$(FFTWPATH)/lib -lfftw_mpi -lfftw $(R_LIBS)</div><div> FP FPOPT(par.comp.options): -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include</div><div> MP MPIRUN commando : mpirun -mca btl ^tcp -mca plm_rsh_num_concurrent 48 -mca oob_tcp_listen_mode listen_thread -mca plm_rsh_tree_spawn 1 -np _NP_ -machinefile _HOSTS_ _EXEC_</div></div><div><br></div><div><br></div><div><br></div><div>Within this environment, the compilation goes without any error messages.</div><div><br></div><div>To make .machines file, I type "proclist=(`cat $TMPDIR/machines`)".</div><div>It gives me the list of nodes according to the number of cpu.</div><div>If I set the total number of cpu 384 in the jobscript file, it export 384 result.</div><div>Since it exports the name of each nodes, there is 8 same node. </div><div><br></div><div><b>case1 : k-point parallelism + 8 mpi task per k-point</b></div><div><b>1. In my case, I owing to calculate with 48 k-points and 8 mpi tasks per node(per k-points), the machine file was,</b></div><div><br></div><div>lapw0:tachyon2066:8 tachyon1982:8 tachyon1207:8 tachyon1396:8 tachyon1152:8 tachyon2440:8 tachyon2120:8 tachyon1555:8 tachyoo</div><div>n2319:8 tachyon2470:8 tachyon1612:8 tachyon2274:8 tachyon1402:8 tachyon2846:8 tachyon2091:8 tachyon1622:8 tachyon1920:8 tachh</div><div>yon2213:8 tachyon1832:8 tachyon2672:8 tachyon2370:8 tachyon2545:8 tachyon2359:8 tachyon1770:8 tachyon1018:8 tachyon1456:8 taa</div><div>chyon1429:8 tachyon3074:8 tachyon1169:8 tachyon2400:8 tachyon2688:8 tachyon1099:8 tachyon2906:8 tachyon1394:8 tachyon1830:8 </div><div>tachyon1383:8 tachyon2157:8 tachyon2818:8 tachyon2644:8 tachyon2283:8 tachyon1213:8 tachyon1542:8 tachyon2726:8 tachyon2152::</div><div>8 tachyon1135:8 tachyon2144:8 tachyon3015:8 tachyon2077:8</div><div>1:tachyon2066:8</div><div>1:tachyon1982:8</div><div>1:tachyon1207:8</div><div>1:tachyon1396:8</div><div>1:tachyon1152:8</div><div>1:tachyon2440:8</div><div>1:tachyon2120:8</div><div>1:tachyon1555:8</div><div>1:tachyon2319:8</div><div>1:tachyon2470:8</div><div>1:tachyon1612:8</div><div>1:tachyon2274:8</div><div>1:tachyon1402:8</div><div>1:tachyon2846:8</div><div>1:tachyon2091:8</div><div>1:tachyon1622:8</div><div>1:tachyon1920:8</div><div>1:tachyon2213:8</div><div>1:tachyon1832:8</div><div>1:tachyon2672:8</div><div>1:tachyon2370:8</div><div>1:tachyon2545:8</div><div>1:tachyon2359:8</div><div>1:tachyon1770:8</div><div>1:tachyon1018:8</div><div>1:tachyon1456:8</div><div>1:tachyon1429:8</div><div>1:tachyon3074:8</div><div>1:tachyon1169:8</div><div>1:tachyon2400:8</div><div>1:tachyon2688:8</div><div>1:tachyon1099:8</div><div>1:tachyon2906:8</div><div>1:tachyon1394:8</div><div>1:tachyon1830:8</div><div>1:tachyon1383:8</div><div>1:tachyon2157:8</div><div>1:tachyon2818:8</div><div>1:tachyon2644:8</div><div>1:tachyon2283:8</div><div>1:tachyon1213:8</div><div>1:tachyon1542:8</div><div>1:tachyon2726:8</div><div>1:tachyon2152:8</div><div>1:tachyon1135:8</div><div>1:tachyon2144:8</div><div>1:tachyon3015:8</div><div>1:tachyon2077:8</div><div>granularity:1</div><div>extrafine:1</div><div>lapw2_vector_split:1</div><div><br></div><div>In this case, </div><div><br></div><div>case.dayfile shows</div><div><br></div><div>on tachyon2066 with PID 13780</div><div>using WIEN2k_11.1 (Release 5/4/2011) in /home01/x584cjh/code/WIEN2k_11</div><div><br></div><div><br></div><div> start (Fri Apr 20 09:13:32 KST 2012) with lapw0 (40/99 to go)</div><div><br></div><div> cycle 1 (Fri Apr 20 09:13:32 KST 2012) (40/99 to go)</div><div><br></div><div>> lapw0 -p (09:13:32) starting parallel lapw0 at Fri Apr 20 09:13:32 KST 2012</div><div>-------- .machine0 : 384 processors</div><div>tachyon2066:14892: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?</div><div>tachyon2066:14892: open_hca: device mthca0 not found</div><div>tachyon2066:14892: open_hca: device mthca0 not found</div><div>tachyon2066:14892: open_hca: device ipath0 not found</div><div>tachyon2066:14892: open_hca: device ipath0 not found</div><div>tachyon2066:14894: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?</div><div>tachyon2066:14894: open_hca: device mthca0 not found</div><div>tachyon2066:14894: open_hca: device mthca0 not found</div><div>tachyon2066:14891: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?</div><div>tachyon2066:14894: open_hca: device ipath0 not found</div><div>tachyon2066:14894: open_hca: device ipath0 not found</div><div>tachyon2319:23519: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?</div><div>tachyon2066:14891: open_hca: device mthca0 not found</div><div>tachyon2066:14891: open_hca: device mthca0 not found</div><div>tachyon1982:11799: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?</div><div>tachyon2319:23519: open_hca: device mthca0 not found</div><div>tachyon2319:23519: open_hca: device mthca0 not found</div><div>tachyon1982:11799: open_hca: device mthca0 not found</div><div>tachyon1982:11799: open_hca: device mthca0 not found</div><div>tachyon2066:14890: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?</div><div>tachyon1982:11801: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?</div><div>tachyon1982:11801: open_hca: device mthca0 not found</div><div>tachyon1982:11801: open_hca: device mthca0 not found</div><div>tachyon2066:14890: open_hca: device mthca0 not found</div><div>tachyon2066:14890: open_hca: device mthca0 not found</div><div>tachyon1982:11805: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?</div><div>tachyon2066:14893: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?</div><div>tachyon1982:11805: open_hca: device mthca0 not found</div><div>tachyon1982:11805: open_hca: device mthca0 not found</div><div>tachyon2066:14891: open_hca: device ipath0 not found</div><div>tachyon2066:14891: open_hca: device ipath0 not found</div><div>tachyon2066:14893: open_hca: device mthca0 not found</div><div>tachyon1982:11803: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?</div><div>tachyon1152:9532: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?</div><div>tachyon2066:14893: open_hca: device mthca0 not found</div><div>tachyon1982:11803: open_hca: device mthca0 not found</div><div>tachyon1982:11803: open_hca: device mthca0 not found</div><div>tachyon1982:11799: open_hca: device ipath0 not found</div><div>tachyon1152:9532: open_hca: device mthca0 not found</div><div>tachyon1152:9532: open_hca: device mthca0 not found</div><div>tachyon1982:11799: open_hca: device ipath0 not found</div><div>tachyon2319:23519: open_hca: device ipath0 not found</div><div>tachyon1982:11801: open_hca: device ipath0 not found</div><div>tachyon2319:23519: open_hca: device ipath0 not found</div><div>tachyon1982:11801: open_hca: device ipath0 not found</div><div>tachyon2066:14888: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?</div><div>tachyon1982:11802: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?</div><div>tachyon2066:14895: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?</div><div>~~~~~~~ </div><div><br></div><div>~~~~~~~</div><div>abbreviation</div><div>~~~~~~~ </div><div><div>tachyon2906:7201: open_hca: device ipath0 not found</div><div>tachyon1920:29303: open_hca: device ipath0 not found</div><div>tachyon1920:29303: open_hca: device ipath0 not found</div><div> 'Unknow' - SIGSEGV, contact developers</div><div> Child id 0 SIGSEGV, contact developers</div><div>1.198u 2.020s 2:15.54 2.3% 0+0k 0+0io 18pf+0w</div><div>error: command /home01/x584cjh/code/WIEN2k_11/lapw0para lapw0.def failed</div><div><br></div><div>> stop error</div></div><div><br></div><div><br></div><div>and created logfile is,</div><div><br></div><div><div>--------------------------------------------------------------------------</div><div>WARNING: Failed to open "OpenIB-cma-1" [DAT_INVALID_ADDRESS:].</div><div>This may be a real error or it may be an invalid entry in the uDAPL</div><div>Registry which is contained in the dat.conf file. Contact your local</div><div>System Administrator to confirm the availability of the interfaces in</div><div>the dat.conf file.</div><div>--------------------------------------------------------------------------</div><div>[tachyon2066:14787] 2243 more processes have sent help message help-mpi-btl-udapl.txt / dat_ia_open fail</div><div>[tachyon2066:14787] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages</div><div>[tachyon2066:14787] 3 more processes have sent help message help-mpi-btl-udapl.txt / dat_ia_open fail</div><div>[tachyon2066:14787] 1 more process has sent help message help-mpi-btl-udapl.txt / dat_ia_open fail</div><div>[tachyon2066:14787] 2 more processes have sent help message help-mpi-btl-udapl.txt / dat_ia_open fail</div><div>[tachyon2066:14787] 3 more processes have sent help message help-mpi-btl-udapl.txt / dat_ia_open fail</div><div>[tachyon2066:14787] 1 more process has sent help message help-mpi-btl-udapl.txt / dat_ia_open fail</div><div>w2k_dispatch_signal(): received: Segmentation fault</div><div>*** An error occurred in MPI_Comm_f2c</div><div>*** before MPI was initialized</div><div>*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)</div><div>[tachyon2066:14895] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!</div><div>w2k_dispatch_signal(): received: Segmentation fault</div><div>w2k_dispatch_signal(): received: Segmentation fault</div><div>w2k_dispatch_signal(): received: Segmentation fault</div><div>w2k_dispatch_signal(): received: Segmentation fault</div><div>w2k_dispatch_signal(): received: Segmentation fault</div><div>w2k_dispatch_signal(): received: Segmentation fault</div><div>w2k_dispatch_signal(): received: Segmentation fault</div><div>--------------------------------------------------------------------------</div><div>mpirun has exited due to process rank 7 with PID 14895 on</div><div>node tachyon2066 exiting without calling "finalize". This may</div><div>have caused other processes in the application to be</div><div>terminated by signals sent by mpirun (as reported here).</div><div>--------------------------------------------------------------------------</div><div>w2k_dispatch_signal(): received: Segmentation fault</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Segmentation fault</div><div>w2k_dispatch_signal(): received: Segmentation fault</div><div>w2k_dispatch_signal(): received: Segmentation fault</div><div>w2k_dispatch_signal(): received: Segmentation fault</div></div><div><div>~~~~~~~ </div><div><br></div><div>~~~~~~~</div><div>abbreviation</div><div>~~~~~~~ </div><div></div></div><div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>w2k_dispatch_signal(): received: Terminated</div><div>[tachyon2066:14787] 2 more processes have sent help message help-mpi-btl-udapl.txt / dat_ia_open fail</div></div><div><br></div><div><br></div><div>Which point is missing?</div><div>What should I do? </div><div><br></div><div>case2 : k-point parallelism</div><div><b>2. Only for the case of k-point parallelism, in this case, I just put the total number of cpu as 384/8=48. </b></div><div><br></div><div>The generated .machine file is</div><div><div><br></div><div>lapw0:tachyon1119:8 tachyon2665:8 tachyon3150:8 tachyon2896:8 tachyon1519:8 tachyon2673:8</div><div>1:tachyon1119</div><div>1:tachyon2665</div><div>1:tachyon3150</div><div>1:tachyon2896</div><div>1:tachyon1519</div><div>1:tachyon2673</div><div>granularity:1</div><div>extrafine:1</div><div>lapw2_vector_split:1</div></div><div><br></div><div>And the generated .processes file is</div><div>init:tachyon1119</div><div>init:tachyon2665</div><div>init:tachyon3150</div><div>init:tachyon2896</div><div>init:tachyon1519</div><div>init:tachyon2673</div><div>1 : tachyon1119 : 8 : 1 : 1</div><div>2 : tachyon2665 : 8 : 1 : 2</div><div>3 : tachyon3150 : 8 : 1 : 3</div><div>4 : tachyon2896 : 8 : 1 : 4</div><div>5 : tachyon1519 : 8 : 1 : 5</div><div>6 : tachyon2673 : 8 : 1 : 6</div><div><br></div><div>And the calculation is going smooth until it gets the time limitation. But the problem is time consuming. </div><div>Below the .dayfile is presented.</div><div><br></div><div><div> start (Wed Apr 18 18:02:36 KST 2012) with lapw0 (40/99 to go)</div><div><br></div><div> cycle 1 (Wed Apr 18 18:02:36 KST 2012) (40/99 to go)</div><div><br></div><div>> lapw0 -p (18:02:36) starting parallel lapw0 at Wed Apr 18 18:02:36 KST 2012</div><div>-------- .machine0 : 48 processors</div><div>83.154u 13.907s 0:17.04 569.5% 0+0k 0+0io 1899pf+0w</div><div>:FORCE convergence: 0 1 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO</div><div>> lapw1 -p (18:02:55) starting parallel lapw1 at Wed Apr 18 18:02:55 KST 2012</div><div>-> starting parallel LAPW1 jobs at Wed Apr 18 18:02:55 KST 2012</div><div>running LAPW1 in parallel mode (using .machines)</div><div>6 number_of_parallel_jobs</div><div> tachyon1001(8) 5167.781u 7.573s 1:26:16.75 99.9% 0+0k 0+0io 175pf+0w</div><div> tachyon1469(8) 5222.568u 8.425s 1:27:12.71 99.9% 0+0k 0+0io 0pf+0w</div><div> tachyon2585(8) 5148.924u 7.837s 1:25:58.19 99.9% 0+0k 0+0io 19pf+0w</div><div> tachyon1214(8) 5170.790u 5.684s 1:26:17.83 99.9% 0+0k 0+0io 0pf+0w</div><div> tachyon2943(8) 5105.165u 5.959s 1:25:12.38 99.9% 0+0k 0+0io 0pf+0w</div><div> tachyon1154(8) 5065.181u 6.282s 1:24:32.88 99.9% 0+0k 0+0io 74pf+0w</div><div> Summary of lapw1para:</div><div> tachyon1001 k=8 user=5167.78 wallclock=86</div><div> tachyon1469 k=8 user=5222.57 wallclock=87</div><div> tachyon2585 k=8 user=5148.92 wallclock=85</div><div> tachyon1214 k=8 user=5170.79 wallclock=86</div><div> tachyon2943 k=8 user=5105.16 wallclock=85</div><div> tachyon1154 k=8 user=5065.18 wallclock=84</div><div>30883.253u 49.709s 1:27:15.42 590.8% 0+0k 0+0io 276pf+0w</div></div><div><br></div><div>This shows LAPW1 cycle runs with 6 nodes and each node calculates with 8 k-point.</div><div>The time is almost 1:30 hour!</div><div>My system contains only 9 Bi atoms and the inversion symmetry is assumed so the number of symmetry operator is 4. So I expected much less time than those. </div><div> </div><div>Can anybody help me?</div><div><br></div><div>Sincerely,</div><div><br></div><div>HJ Kim.</div><div><br></div><div apple-content-edited="true">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div><div><br></div></div></div><br class="Apple-interchange-newline"></div><br class="Apple-interchange-newline"></div><br class="Apple-interchange-newline"></div><br class="Apple-interchange-newline"></div><br class="Apple-interchange-newline"></div><br class="Apple-interchange-newline"></div><br class="Apple-interchange-newline"><br class="Apple-interchange-newline">
</div>
<br></body></html>