[Wien] Problems with mpi for Wien12.1

Laurence Marks L-marks at northwestern.edu
Fri Aug 24 15:35:57 CEST 2012


In my experience the SIGSEV normally comes from mixing different flavors of
mpif90 and mpirun. Openmpi, mpich2 and Intels mpi all need different
versions of blacs. You can also have problems if you choose the wrong model
for integers in the linking advisor page. I would check using ldd that
lapw0_mpi is linked to the right version, and that the default versions are
correct (e.g. which mpirun). Often you can minimize problems by using
static linking for mpi.

N.B. The "contact developers" message is a relic of when some code was
added for fault handlers and to eliminate issues with limits that used to
be pervasive. It should probably be removed.

---------------------------
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what nobody
else has thought"
Albert Szent-Gyorgi
 On Aug 24, 2012 8:22 AM, "Paul Fons" <paul-fons at aist.go.jp> wrote:

>  Dear Prof. Blaha,
> Thank you for your earlier email.  Running the command manually gives the
> following output (for a GaAs structure that works fine in serial or k-point
> parallel form).  I am still not sure what to try next.  Any suggestions?
>
>
>  matstud at ursa:~/WienDisk/Fons/GaAs> mpirun -np 4 ${WIENROOT}/lapw0_mpi
> lapw0.def
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
>  Child id           0 SIGSEGV, contact developers
>  Child id           1 SIGSEGV, contact developers
>  Child id           3 SIGSEGV, contact developers
>  Child id           2 SIGSEGV, contact developers
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
> APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
>
>
>  The MPI compilation options from siteconfig are as follows: (the
> settings are from the Intel MKL link advisor plus the fftw3 library)
>
>   Current settings:
>      RP  RP_LIB(SCALAPACK+PBLAS): -L$(MKLROOT)/lib/intel64
> $(MKLROOT)/lib/intel64/libmkl_blas95_lp64.a
> $(MKLROOT)/lib/intel64/libmkl_lapack95_lp64.a -lmkl_scalapack_lp64
> -lmkl_cdft_core -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core
> -lmkl_blacs_intelmpi_lp64 -openmp -lpthread -lm -L/opt/local/fftw3/lib/
> -lfftw3_mpi -lfftw3 $(R_LIBS)
>      FP  FPOPT(par.comp.options): -I$(MKLROOT)/include/intel64/lp64
> -I$(MKLROOT)/include -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML
> -DFFTW3 -traceback
>      MP  MPIRUN commando        : mpirun -np _NP_ -machinefile _HOSTS_
> _EXEC_
>
>  The file parallel_options now reads
>  setenv USE_REMOTE 1
> setenv MPI_REMOTE 0
> setenv WIEN_GRANULARITY 1
> setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"
>
>
>  I changed the MPI_REMOTE to 0 as suggested (I was not sure this applied
> to the Intel MPI environment as the siteconfig prompt only mentioned mich2.
>
>  As I mentioned the mpirun command seems to work fine.  For example, the
> fftw3 benchmark program gives with 24 processes
>
>  mpirun -np 24 ./mpi-bench 1024x1024
> Problem: 1024x1024, setup: 126.32 ms, time: 15.98 ms, ``mflops'': 6562.2
>
>
>
>  On Aug 24, 2012, at 3:05 PM, Peter Blaha wrote:
>
>  Hard to say.
>
> What is in $WIENROOT/parallel_options ?
> MPI_REMOTE should be 0 !
>
> Otherwise run lapw0_mpi by "hand":
>
> mpirun -np 4 $WIENROOT/lapw0_mpi lapw0.def   (or including  .machinefile
> .machine0)
>
>
> Am 24.08.2012 02:24, schrieb Paul Fons:
>
> Greetings all,
>
>   I have compiled Wien2K 12.1 under OpenSuse 11.4 (and OpenSuse 12.1)
>
> and the latest Intel compilers with identical mpi launch problems and I
>
> am hoping for some suggestions as to where to look to fix things.  Note
>
> that the serial and k-point parallel versions of the code run fine (I
>
> have optimized GaAs a lot in my troubleshooting!).
>
>
>  Environment.
>
>
>  I am using the latest intel fort, icc, and impi libraries for linux.
>
>
>  matstud at pyxis:~/Wien2K> ifort --version
>
> ifort (IFORT) 12.1.5 20120612
>
> Copyright (C) 1985-2012 Intel Corporation.  All rights reserved.
>
>
>  matstud at pyxis:~/Wien2K> mpirun --version
>
> Intel(R) MPI Library for Linux* OS, Version 4.0 Update 3 Build 20110824
>
> Copyright (C) 2003-2011, Intel Corporation. All rights reserved.
>
>
>  matstud at pyxis:~/Wien2K> icc --version
>
> icc (ICC) 12.1.5 20120612
>
> Copyright (C) 1985-2012 Intel Corporation.  All rights reserved.
>
>
>
>  My OPTIONS files from /siteconfig_lapw
>
>
>  current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
>
> current:FPOPT:-I$(MKLROOT)/include/intel64/lp64 -I$(MKLROOT)/include -FR
>
> -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -DFFTW3 -traceback
>
> current:LDFLAGS:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread
>
> current:DPARALLEL:'-DParallel'
>
> current:R_LIBS:-lmkl_lapack95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread
>
> -lmkl_core -openmp -lpthread
>
> current:RP_LIBS:-L$(MKLROOT)/lib/intel64
>
> $(MKLROOT)/lib/intel64/libmkl_blas95_lp64.a
>
> $(MKLROOT)/lib/intel64/libmkl_lapack95_lp64.a -lmkl_scalapack_lp64
>
> -lmkl_cdft_core -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core
>
> -lmkl_blacs_intelmpi_lp64 -openmp -lpthread -lm -L/opt/local/fftw3/lib/
>
> -lfftw3_mpi -lfftw3 $(R_LIBS)
>
> current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_
>
>
>
>
>
>  The code compiles and links without error.  It runs fine in serial mode
>
> and in k-point parallel mode, e.g.
>
>
>  .machines with
>
>
>  1:localhost
>
> 1:localhost
>
> 1:localhost
>
> granularity:1
>
> extrafine:1
>
>
>  This runs fine.  When I attempt to run a mpi process with 12 processes
>
> (on a 12 core machine), I crash and burn (see below) with a SIGSEV error
>
> with instructions to contact the developers.
>
>
>  The linking options were derived from Intel's mkl link advisor (the
>
> version on the intel site.  I should add that the mpi-bench in fftw3
>
> works fine using the intel mpi as do commands like hostname or even
>
> abinit so it would appear that that the Intel MPI environment itself is
>
> fine.  I have wasted a lot of time trying to figure out how to fix this
>
> before writing to the list, but at this point, I feel like a monkey at a
>
> keyboard attempting to duplicate Shakesphere -- if you know what I mean.
>
>  Thanks in advance for any heads up that you can offer.
>
>
>
>
>  .machines
>
>
>  lapw0:localhost:12
>
> 1:localhost:12
>
> granularity:1
>
> extrafine:1
>
>
>   stop error
>
>
>  error: command   /home/matstud/Wien2K/lapw0para -c lapw0.def   failed
>
> 0.029u 0.046s 0:00.93 6.4% 0+0k 0+176io 0pf+0w
>
>  Child id           2 SIGSEGV, contact developers
>
>  Child id           8 SIGSEGV, contact developers
>
>  Child id           7 SIGSEGV, contact developers
>
>  Child id          11 SIGSEGV, contact developers
>
>  Child id          10 SIGSEGV, contact developers
>
>  Child id           9 SIGSEGV, contact developers
>
>  Child id           6 SIGSEGV, contact developers
>
>  Child id           5 SIGSEGV, contact developers
>
>  Child id           4 SIGSEGV, contact developers
>
>  Child id           3 SIGSEGV, contact developers
>
>  Child id           1 SIGSEGV, contact developers
>
>  Child id           0 SIGSEGV, contact developers
>
> -------- .machine0 : 12 processors
>
>   lapw0 -p (09:04:45) starting parallel lapw0 at Fri Aug 24 09:04:45 JST
> 2012
>
>
>      cycle 1 (Fri Aug 24 09:04:45 JST 2012) (40/99 to go)
>
>
>      start (Fri Aug 24 09:04:45 JST 2012) with lapw0 (40/99 to go)
>
>
>
>  using WIEN2k_12.1 (Release 22/7/2012) in /home/matstud/Wien2K
>
> on pyxis with PID 15375
>
> Calculating GaAs in /usr/local/share/Wien2K/Fons/GaAs
>
>
>
>
>
>  _______________________________________________
>
> Wien mailing list
>
> Wien at zeus.theochem.tuwien.ac.at
>
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
>
>
> --
> Peter Blaha
> Inst.Materials Chemistry
> TU Vienna
> Getreidemarkt 9
> A-1060 Vienna
> Austria
> +43-1-5880115671
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
>
>   Dr. Paul Fons
>  Senior Research Scientist
>  Functional Nano-phase-change Research Team
>  Nanoelectronics Research Institute
>  National Institute for Advanced Industrial Science & Technology
>  METI
>
>  AIST Central 4, Higashi 1-1-1
>  Tsukuba, Ibaraki JAPAN 305-8568
>
>  tel. +81-298-61-5636
>  fax. +81-298-61-2939
>
>  email: *paul-fons at aist.go.jp*
>
>  The following lines are in a Japanese font
>
>  〒305-8562 茨城県つくば市つくば中央東 1-1-1
>  産業技術総合研究所
>  ナノエレクトロニクス研究部門
>  相変化新規機能デバイス研究チーム
>  主任研究員
>  ポール・フォンス
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20120824/e475af32/attachment.htm>


More information about the Wien mailing list