[Wien] Problem when running MPI-parallel version of LAPW0

Rémi Arras remi.arras at cemes.fr
Wed Oct 22 13:29:09 CEST 2014


Dear Pr. Blaha, Dear Wien2k users,

We tried to install the last version of Wien2k (14.1) on a supercomputer 
and we are facing some troubles with the MPI parallel version.

1)lapw0 is running correctly in sequential, but crashes systematically 
when the parallel option is activated (independently of the number of 
cores we use):

>lapw0 -p(16:08:13) starting parallel lapw0 at lun. sept. 29 16:08:13 
CEST 2014
-------- .machine0 : 4 processors
Child id1 SIGSEGV
Child id2 SIGSEGV
Child id3 SIGSEGV
Child id0 SIGSEGV
**lapw0 crashed!
0.029u 0.036s 0:50.91 0.0%0+0k 5248+104io 17pf+0w
error: command/eos3/p1229/remir/INSTALLATION_WIEN/14.1/lapw0para -up -c 
lapw0.deffailed
>stop error

w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
Child with myid of1has an error
'Unknown' - SIGSEGV
Child id1 SIGSEGV
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 1
**lapw0 crashed!
cat: No match.0.027u 0.034s 1:33.13 0.0%0+0k 5200+96io 16pf+0w
error: command/eos3/p1229/remir/INSTALLATION_WIEN/14.1/lapw0para -up -c 
lapw0.deffailed


2) lapw2 also crashes sometimes when MPI parallelization is used. 
Sequential or k-parallel runs are ok, and contrary to lapw0, the error 
does not occur for all cases (we did not notice any problem when testing 
the mpi benchmark with lapw1):

w2k_dispatch_signal(): received: Segmentation fault application called 
MPI_Abort(MPI_COMM_WORLD, 768) - process 0

Our system is a Bullx DLC Cluster (LInux Red Hat+ Intel Ivybridge) and 
we use the compiler(+mkl) intel/14.0.2.144 and intelmpi/4.1.3.049.
The batch Scheduler is SLURM.

Here are the settings and the options we used for the installation :

OPTIONS:
current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML 
-Dmkl_scalapack -traceback -xAVX
current:FFTW_OPT:-DFFTW3 
-I/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/include
current:FFTW_LIBS:-lfftw3_mpi -lfftw3 
-L/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/lib
current:LDFLAGS:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread
current:DPARALLEL:'-DParallel'
current:R_LIBS:-lmkl_lapack95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread 
-lmkl_core -openmp -lpthread
current:RP_LIBS:-mkl=cluster -lfftw3_mpi -lfftw3 
-L/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/lib
current:MPIRUN:mpirun -np _NP_ _EXEC_
current:MKL_TARGET_ARCH:intel64

PARALLEL_OPTIONS:
setenv TASKSET "no"
setenv USE_REMOTE 1
setenv MPI_REMOTE 1
setenv WIEN_GRANULARITY 1
setenv WIEN_MPIRUN "mpirun -np _NP_ _EXEC_"

Any suggestions which could help us to solve this problem would be 
greatly appreciated.

Best regards,
Rémi Arras
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20141022/18086590/attachment.html>


More information about the Wien mailing list