[Wien] Problem when running MPI-parallel version of LAPW0

Laurence Marks L-marks at northwestern.edu
Wed Oct 22 13:48:20 CEST 2014


It is often hard to know exactly what issues are with mpi. Most often it is
due to incorrect combinations of scalapack/blacs in the linking options.

The first think to check is your linking options with
https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/.
What you have does not look exactly right to me, but I have not used your
release.

If that does not work, look in case.dayfile, the log file.

If there is still nothing it is sometimes useful to comment out the line

      CALL W2kinit

in lapw0.F, recompile then just do "x lapw0 -p". You sometimes will get
more information although it is not as safe as mpi tasks can hang forever
without it in some cases.

On Wed, Oct 22, 2014 at 6:29 AM, Rémi Arras <remi.arras at cemes.fr> wrote:

>  Dear Pr. Blaha, Dear Wien2k users,
>
> We tried to install the last version of Wien2k (14.1) on a supercomputer
> and we are facing some troubles with the MPI parallel version.
>
> 1)  lapw0 is running correctly in sequential, but crashes systematically
> when the parallel option is activated (independently of the number of cores
> we use):
>
> >   lapw0 -p    (16:08:13) starting parallel lapw0 at lun. sept. 29 16:08:13
> CEST 2014
> -------- .machine0 : 4 processors
>  Child id           1 SIGSEGV
>  Child id           2 SIGSEGV
>  Child id           3 SIGSEGV
>  Child id           0 SIGSEGV
> **  lapw0 crashed!
> 0.029u 0.036s 0:50.91 0.0%      0+0k 5248+104io 17pf+0w
> error: command   /eos3/p1229/remir/INSTALLATION_WIEN/14.1/lapw0para -up -c
> lapw0.def   failed
> >   stop error
>
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
>  Child with myid of            1  has an error
> 'Unknown' - SIGSEGV
>  Child id           1 SIGSEGV
> application called MPI_Abort(MPI_COMM_WORLD, 0) - process 1
> **  lapw0 crashed!
> cat: No match.0.027u 0.034s 1:33.13 0.0%      0+0k 5200+96io 16pf+0w
> error: command   /eos3/p1229/remir/INSTALLATION_WIEN/14.1/lapw0para -up
> -c lapw0.def   failed
>
>
> 2) lapw2 also crashes sometimes when MPI parallelization is used.
> Sequential or k-parallel runs are ok, and contrary to lapw0, the error does
> not occur for all cases (we did not notice any problem when testing the
> mpi benchmark with lapw1):
>
> w2k_dispatch_signal(): received: Segmentation fault application called
> MPI_Abort(MPI_COMM_WORLD, 768) - process 0
>
> Our system is a Bullx DLC Cluster (LInux Red Hat+ Intel Ivybridge) and we
> use the compiler(+mkl) intel/14.0.2.144 and intelmpi/4.1.3.049.
> The batch Scheduler is SLURM.
>
> Here are the settings and the options we used for the installation :
>
> OPTIONS:
> current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
> current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML
> -Dmkl_scalapack -traceback -xAVX
> current:FFTW_OPT:-DFFTW3
> -I/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/include
> current:FFTW_LIBS:-lfftw3_mpi -lfftw3
> -L/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/lib
> current:LDFLAGS:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread
> current:DPARALLEL:'-DParallel'
> current:R_LIBS:-lmkl_lapack95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread
> -lmkl_core -openmp -lpthread
> current:RP_LIBS:-mkl=cluster -lfftw3_mpi -lfftw3
> -L/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/lib
> current:MPIRUN:mpirun -np _NP_ _EXEC_
> current:MKL_TARGET_ARCH:intel64
>
> PARALLEL_OPTIONS:
> setenv TASKSET "no"
> setenv USE_REMOTE 1
> setenv MPI_REMOTE 1
> setenv WIEN_GRANULARITY 1
> setenv WIEN_MPIRUN "mpirun -np _NP_ _EXEC_"
>
> Any suggestions which could help us to solve this problem would be greatly
> appreciated.
>
> Best regards,
> Rémi Arras
>



-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
Corrosion in 4D: MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
"Research is to see what everybody else has seen, and to think what nobody
else has thought"
Albert Szent-Gyorgi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20141022/66b7523c/attachment-0001.html>


More information about the Wien mailing list