[Wien] Problems with mpi for Wien12.1

Paul Fons paul-fons at aist.go.jp
Fri Aug 24 15:22:37 CEST 2012


Dear Prof. Blaha,
Thank you for your earlier email.  Running the command manually gives the following output (for a GaAs structure that works fine in serial or k-point parallel form).  I am still not sure what to try next.  Any suggestions?
 

matstud at ursa:~/WienDisk/Fons/GaAs> mpirun -np 4 ${WIENROOT}/lapw0_mpi lapw0.def
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
 Child id           0 SIGSEGV, contact developers
 Child id           1 SIGSEGV, contact developers
 Child id           3 SIGSEGV, contact developers
 Child id           2 SIGSEGV, contact developers
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)


The MPI compilation options from siteconfig are as follows: (the settings are from the Intel MKL link advisor plus the fftw3 library)

 Current settings:
     RP  RP_LIB(SCALAPACK+PBLAS): -L$(MKLROOT)/lib/intel64 $(MKLROOT)/lib/intel64/libmkl_blas95_lp64.a $(MKLROOT)/lib/intel64/libmkl_lapack95_lp64.a -lmkl_scalapack_lp64 -lmkl_cdft_core -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -openmp -lpthread -lm -L/opt/local/fftw3/lib/ -lfftw3_mpi -lfftw3 $(R_LIBS)
     FP  FPOPT(par.comp.options): -I$(MKLROOT)/include/intel64/lp64 -I$(MKLROOT)/include -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -DFFTW3 -traceback
     MP  MPIRUN commando        : mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_

The file parallel_options now reads
setenv USE_REMOTE 1
setenv MPI_REMOTE 0
setenv WIEN_GRANULARITY 1
setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"


I changed the MPI_REMOTE to 0 as suggested (I was not sure this applied to the Intel MPI environment as the siteconfig prompt only mentioned mich2.

As I mentioned the mpirun command seems to work fine.  For example, the fftw3 benchmark program gives with 24 processes

mpirun -np 24 ./mpi-bench 1024x1024
Problem: 1024x1024, setup: 126.32 ms, time: 15.98 ms, ``mflops'': 6562.2



On Aug 24, 2012, at 3:05 PM, Peter Blaha wrote:

> Hard to say.
> 
> What is in $WIENROOT/parallel_options ?
> MPI_REMOTE should be 0 !
> 
> Otherwise run lapw0_mpi by "hand":
> 
> mpirun -np 4 $WIENROOT/lapw0_mpi lapw0.def   (or including  .machinefile .machine0)
> 
> 
> Am 24.08.2012 02:24, schrieb Paul Fons:
>> Greetings all,
>>   I have compiled Wien2K 12.1 under OpenSuse 11.4 (and OpenSuse 12.1)
>> and the latest Intel compilers with identical mpi launch problems and I
>> am hoping for some suggestions as to where to look to fix things.  Note
>> that the serial and k-point parallel versions of the code run fine (I
>> have optimized GaAs a lot in my troubleshooting!).
>> 
>> Environment.
>> 
>> I am using the latest intel fort, icc, and impi libraries for linux.
>> 
>> matstud at pyxis:~/Wien2K> ifort --version
>> ifort (IFORT) 12.1.5 20120612
>> Copyright (C) 1985-2012 Intel Corporation.  All rights reserved.
>> 
>> matstud at pyxis:~/Wien2K> mpirun --version
>> Intel(R) MPI Library for Linux* OS, Version 4.0 Update 3 Build 20110824
>> Copyright (C) 2003-2011, Intel Corporation. All rights reserved.
>> 
>> matstud at pyxis:~/Wien2K> icc --version
>> icc (ICC) 12.1.5 20120612
>> Copyright (C) 1985-2012 Intel Corporation.  All rights reserved.
>> 
>> 
>> My OPTIONS files from /siteconfig_lapw
>> 
>> current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
>> current:FPOPT:-I$(MKLROOT)/include/intel64/lp64 -I$(MKLROOT)/include -FR
>> -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -DFFTW3 -traceback
>> current:LDFLAGS:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread
>> current:DPARALLEL:'-DParallel'
>> current:R_LIBS:-lmkl_lapack95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread
>> -lmkl_core -openmp -lpthread
>> current:RP_LIBS:-L$(MKLROOT)/lib/intel64
>> $(MKLROOT)/lib/intel64/libmkl_blas95_lp64.a
>> $(MKLROOT)/lib/intel64/libmkl_lapack95_lp64.a -lmkl_scalapack_lp64
>> -lmkl_cdft_core -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core
>> -lmkl_blacs_intelmpi_lp64 -openmp -lpthread -lm -L/opt/local/fftw3/lib/
>> -lfftw3_mpi -lfftw3 $(R_LIBS)
>> current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_
>> 
>> 
>> 
>> 
>> The code compiles and links without error.  It runs fine in serial mode
>> and in k-point parallel mode, e.g.
>> 
>> .machines with
>> 
>> 1:localhost
>> 1:localhost
>> 1:localhost
>> granularity:1
>> extrafine:1
>> 
>> This runs fine.  When I attempt to run a mpi process with 12 processes
>> (on a 12 core machine), I crash and burn (see below) with a SIGSEV error
>> with instructions to contact the developers.
>> 
>> The linking options were derived from Intel's mkl link advisor (the
>> version on the intel site.  I should add that the mpi-bench in fftw3
>> works fine using the intel mpi as do commands like hostname or even
>> abinit so it would appear that that the Intel MPI environment itself is
>> fine.  I have wasted a lot of time trying to figure out how to fix this
>> before writing to the list, but at this point, I feel like a monkey at a
>> keyboard attempting to duplicate Shakesphere -- if you know what I mean.
>>  Thanks in advance for any heads up that you can offer.
>> 
>> 
>> 
>> .machines
>> 
>> lapw0:localhost:12
>> 1:localhost:12
>> granularity:1
>> extrafine:1
>> 
>>>  stop error
>> 
>> error: command   /home/matstud/Wien2K/lapw0para -c lapw0.def   failed
>> 0.029u 0.046s 0:00.93 6.4%	0+0k 0+176io 0pf+0w
>>  Child id           2 SIGSEGV, contact developers
>>  Child id           8 SIGSEGV, contact developers
>>  Child id           7 SIGSEGV, contact developers
>>  Child id          11 SIGSEGV, contact developers
>>  Child id          10 SIGSEGV, contact developers
>>  Child id           9 SIGSEGV, contact developers
>>  Child id           6 SIGSEGV, contact developers
>>  Child id           5 SIGSEGV, contact developers
>>  Child id           4 SIGSEGV, contact developers
>>  Child id           3 SIGSEGV, contact developers
>>  Child id           1 SIGSEGV, contact developers
>>  Child id           0 SIGSEGV, contact developers
>> -------- .machine0 : 12 processors
>>>  lapw0 -p	(09:04:45) starting parallel lapw0 at Fri Aug 24 09:04:45 JST 2012
>> 
>>     cycle 1 	(Fri Aug 24 09:04:45 JST 2012) 	(40/99 to go)
>> 
>>     start 	(Fri Aug 24 09:04:45 JST 2012) with lapw0 (40/99 to go)
>> 
>> 
>> using WIEN2k_12.1 (Release 22/7/2012) in /home/matstud/Wien2K
>> on pyxis with PID 15375
>> Calculating GaAs in /usr/local/share/Wien2K/Fons/GaAs
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> 
> 
> -- 
> Peter Blaha
> Inst.Materials Chemistry
> TU Vienna
> Getreidemarkt 9
> A-1060 Vienna
> Austria
> +43-1-5880115671
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

Dr. Paul Fons
Senior Research Scientist
Functional Nano-phase-change Research Team
Nanoelectronics Research Institute
National Institute for Advanced Industrial Science & Technology
METI

AIST Central 4, Higashi 1-1-1
Tsukuba, Ibaraki JAPAN 305-8568

tel. +81-298-61-5636
fax. +81-298-61-2939

email: paul-fons at aist.go.jp

The following lines are in a Japanese font

〒305-8562 茨城県つくば市つくば中央東 1-1-1
産業技術総合研究所
ナノエレクトロニクス研究部門
相変化新規機能デバイス研究チーム
主任研究員
ポール・フォンス





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20120824/0607a4a0/attachment.htm>


More information about the Wien mailing list