[Wien] Problem when running MPI-parallel version of LAPW0
Michael Sluydts
Michael.Sluydts at UGent.be
Wed Oct 22 13:37:28 CEST 2014
Perhaps an important note: the python script is for a Torque PBS queuing
system (based on $PBS_NODEFILE)
Rémi Arras schreef op 22/10/2014 13:29:
> Dear Pr. Blaha, Dear Wien2k users,
>
> We tried to install the last version of Wien2k (14.1) on a
> supercomputer and we are facing some troubles with the MPI parallel
> version.
>
> 1)lapw0 is running correctly in sequential, but crashes systematically
> when the parallel option is activated (independently of the number of
> cores we use):
>
> >lapw0 -p(16:08:13) starting parallel lapw0 at lun. sept. 29 16:08:13
> CEST 2014
> -------- .machine0 : 4 processors
> Child id1 SIGSEGV
> Child id2 SIGSEGV
> Child id3 SIGSEGV
> Child id0 SIGSEGV
> **lapw0 crashed!
> 0.029u 0.036s 0:50.91 0.0%0+0k 5248+104io 17pf+0w
> error: command/eos3/p1229/remir/INSTALLATION_WIEN/14.1/lapw0para -up
> -c lapw0.deffailed
> >stop error
>
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> Child with myid of1has an error
> 'Unknown' - SIGSEGV
> Child id1 SIGSEGV
> application called MPI_Abort(MPI_COMM_WORLD, 0) - process 1
> **lapw0 crashed!
> cat: No match.0.027u 0.034s 1:33.13 0.0%0+0k 5200+96io 16pf+0w
> error: command/eos3/p1229/remir/INSTALLATION_WIEN/14.1/lapw0para -up
> -c lapw0.deffailed
>
>
> 2) lapw2 also crashes sometimes when MPI parallelization is used.
> Sequential or k-parallel runs are ok, and contrary to lapw0, the error
> does not occur for all cases (we did not notice any problem when
> testing the mpi benchmark with lapw1):
>
> w2k_dispatch_signal(): received: Segmentation fault application called
> MPI_Abort(MPI_COMM_WORLD, 768) - process 0
>
> Our system is a Bullx DLC Cluster (LInux Red Hat+ Intel Ivybridge) and
> we use the compiler(+mkl) intel/14.0.2.144 and intelmpi/4.1.3.049.
> The batch Scheduler is SLURM.
>
> Here are the settings and the options we used for the installation :
>
> OPTIONS:
> current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
> current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML
> -Dmkl_scalapack -traceback -xAVX
> current:FFTW_OPT:-DFFTW3
> -I/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/include
> current:FFTW_LIBS:-lfftw3_mpi -lfftw3
> -L/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/lib
> current:LDFLAGS:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread
> current:DPARALLEL:'-DParallel'
> current:R_LIBS:-lmkl_lapack95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread
> -lmkl_core -openmp -lpthread
> current:RP_LIBS:-mkl=cluster -lfftw3_mpi -lfftw3
> -L/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/lib
> current:MPIRUN:mpirun -np _NP_ _EXEC_
> current:MKL_TARGET_ARCH:intel64
>
> PARALLEL_OPTIONS:
> setenv TASKSET "no"
> setenv USE_REMOTE 1
> setenv MPI_REMOTE 1
> setenv WIEN_GRANULARITY 1
> setenv WIEN_MPIRUN "mpirun -np _NP_ _EXEC_"
>
> Any suggestions which could help us to solve this problem would be
> greatly appreciated.
>
> Best regards,
> Rémi Arras
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20141022/a0d6b7ac/attachment.html>
More information about the Wien
mailing list