[Wien] Problems with mpi for Wien12.1
Paul Fons
paul-fons at aist.go.jp
Tue Aug 28 01:21:17 CEST 2012
Dear Prof. Blaha,
I was under the impression that I had replied promptly to your initial question. I apologize for the delay. I have been using the mpi complier of the intel mpi (4.0.3) suite, namely mpiifort. Here are the results of the which operation and the underlying version of the fortran compiler. Thank you for your hep.
matstud at ursa:~/Wien2K> which mpiifort
/opt/intel/impi/4.0.3.008/intel64/bin/mpiifort
matstud at ursa:~/Wien2K> mpiifort --version
ifort (IFORT) 12.1.5 20120612
Copyright (C) 1985-2012 Intel Corporation. All rights reserved.
Below find a short sequence from a recompile of lapw1 using siteconfig. I note that mpiifort is being used.
touch .parallel
make PARALLEL='-DParallel' TYPE='REAL' TYPE_COMMENT='\!_REAL' \
./lapw1_mpi FORT=mpiifort FFLAGS=' -I/opt/intel/composer_xe_2011_sp1.11.339/mkl/include/intel64/lp64 -I/opt/intel/composer_xe_2011_sp1.11.339/mkl/include -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -DFFTW3 -traceback '-DParallel''
make[1]: Entering directory `/home/matstud/Wien2K_12_1/SRC_lapw1'
modules.F: REAL version extracted
mpiifort -I/opt/intel/composer_xe_2011_sp1.11.339/mkl/include/intel64/lp64 -I/opt/intel/composer_xe_2011_sp1.11.339/mkl/include -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -DFFTW3 -traceback -DParallel -c modules_tmp_.F
mv modules_tmp_.o modules.o
rm modules_tmp_.F
mpiifort -I/opt/intel/composer_xe_2011_sp1.11.339/mkl/include/intel64/lp64 -I/opt/intel/composer_xe_2011_sp1.11.339/mkl/include -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -DFFTW3 -traceback -DParallel -c abc.f
atpar.F: REAL version extracted
mpiifort -I/opt/intel/composer_xe_2011_sp1.11.339/mkl/include/intel64/lp64 -I/opt/intel/composer_xe_2011_sp1.11.339/mkl/include -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -DFFTW3 -traceback -DParallel -c atpar_tmp_.F
mv atpar_tmp_.o atpar.o
rm atpar_tmp_.F
calkpt.F: REAL version extracted
mpiifort -I/opt/intel/composer_xe_2011_sp1.11.339/mkl/include/intel64/lp64 -I/opt/intel/composer_xe_2011_sp1.11.339/mkl/include -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -DFFTW3 -traceback -DParallel -c calkpt_tmp_.F
mv calkpt_tmp_.o calkpt.o
rm calkpt_tmp_.F
mpiifort -I/opt/intel/composer_xe_2011_sp1.11.339/mkl/include/intel64/lp64 -I/opt/intel/composer_xe_2011_sp1.11.339/mkl/include -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -DFFTW3 -traceback -DParallel -c cbcomb.f
mpiifort -I/opt/intel/composer_xe_2011_sp1.11.339/mkl/include/intel64/lp64 -I/opt/intel/composer_xe_2011_sp1.11.339/mkl/include -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -DFFTW3 -traceback -DParallel -c coors.f
dscgst.F: REAL version extracted
mpiifort -I/opt/intel/composer_xe_2011_sp1.11.339/mkl/include/intel64/lp64 -I/opt/intel/composer_xe_2011_sp1.11.339/mkl/include -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -DFFTW3 -traceback -DParallel -c dscgst_tmp_.F
mv dscgst_tmp_.o dscgst.o
rm dscgst_tmp_.F
and the final linking step
mv W2kinit_tmp_.o W2kinit.o
rm W2kinit_tmp_.F
mpiifort -o ./lapw1c_mpi abc.o atpar.o bandv1.o calkpt.o cbcomb.o coors.o cputim.o dblr2k.o dgeqrl.o dgewy.o dgewyg.o dlbrfg.o dsbein1.o dscgst.o dstebz2.o dsyevx2.o dsyr2m.o dsyrb4.o dsyrb5l.o dsyrdt4.o dsywyv.o dsyxev4.o dvbes1.o eisps.o errclr.o errflg.o forfhs.o gaunt1.o gaunt2.o gbass.o gtfnam.o hamilt.o hns.o horb.o inikpt.o inilpw.o lapw1.o latgen.o lmsort.o locdef.o lohns.o lopw.o matmm.o modules.o nn.o outerr.o outwinb.o prtkpt.o prtres.o pzheevx16.o rdswar.o rint13.o rotate.o rotdef.o seclit.o seclr4.o seclr5.o select.o service.o setkpt.o setwar.o sphbes.o stern.o SymmRot.o tapewf.o ustphx.o vectf.o warpin.o wfpnt.o wfpnt1.o ylm.o zhcgst.o zheevx2.o zher2m.o jacdavblock.o make_albl.o global2local.o par_syrk.o my_dsygst.o refblas_dtrsm.o seclit_par.o pdsyevx17.o pdstebz17.o pdgetri_my.o pzgetri_my.o W2kutils.o W2kinit.o -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback -L/opt/intel/composer_xe_2011_sp1.11.339/mkl/lib/intel64 -pthread -L/opt/intel/composer_xe_2011_sp1.11.339/mkl/lib/intel64 /opt/intel/composer_xe_2011_sp1.11.339/mkl/lib/intel64/libmkl_blas95_lp64.a /opt/intel/composer_xe_2011_sp1.11.339/mkl/lib/intel64/libmkl_lapack95_lp64.a -lmkl_scalapack_lp64 -lmkl_cdft_core -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -openmp -lpthread -lm -L/opt/local/fftw3/lib/ -lfftw3_mpi -lfftw3 -lmkl_lapack95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread
make[1]: Leaving directory `/home/matstud/Wien2K_12_1/SRC_lapw1'
Copying programs
SRC_lapw1/lapw1
SRC_lapw1/lapw1c
SRC_lapw1/lapw1_mpi
SRC_lapw1/lapw1c_mpi
done.
Compile time errors (if any) were:
On Aug 24, 2012, at 11:59 PM, Peter Blaha wrote:
> To make this comment more clear:
>
> You did not tell us which command you are using for MPF (parallel compiler). It is not always mpif90 (as this could use some other compiler or mpi)
> it could be mpiifort or something else.
>
> Then check with "which mpif90" if it points to the proper directory/version of mpi,....
>
> Am 24.08.2012 15:35, schrieb Laurence Marks:
>> In my experience the SIGSEV normally comes from mixing different flavors of mpif90 and mpirun. Openmpi, mpich2 and Intels mpi all need different versions of blacs. You can also
>> have problems if you choose the wrong model for integers in the linking advisor page. I would check using ldd that lapw0_mpi is linked to the right version, and that the default
>> versions are correct (e.g. which mpirun). Often you can minimize problems by using static linking for mpi.
>>
>> N.B. The "contact developers" message is a relic of when some code was added for fault handlers and to eliminate issues with limits that used to be pervasive. It should probably
>> be removed.
>>
>> ---------------------------
>> Professor Laurence Marks
>> Department of Materials Science and Engineering
>> Northwestern University
>> www.numis.northwestern.edu <http://www.numis.northwestern.edu> 1-847-491-3996
>> "Research is to see what everybody else has seen, and to think what nobody else has thought"
>> Albert Szent-Gyorgi
>>
>> On Aug 24, 2012 8:22 AM, "Paul Fons" <paul-fons at aist.go.jp <mailto:paul-fons at aist.go.jp>> wrote:
>>
>> Dear Prof. Blaha,
>> Thank you for your earlier email. Running the command manually gives the following output (for a GaAs structure that works fine in serial or k-point parallel form). I am
>> still not sure what to try next. Any suggestions?
>>
>> matstud at ursa:~/WienDisk/Fons/GaAs> mpirun -np 4 ${WIENROOT}/lapw0_mpi lapw0.def
>> w2k_dispatch_signal(): received: Segmentation fault
>> w2k_dispatch_signal(): received: Segmentation fault
>> w2k_dispatch_signal(): received: Segmentation fault
>> w2k_dispatch_signal(): received: Segmentation fault
>> Child id 0 SIGSEGV, contact developers
>> Child id 1 SIGSEGV, contact developers
>> Child id 3 SIGSEGV, contact developers
>> Child id 2 SIGSEGV, contact developers
>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>> APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
>>
>>
>> The MPI compilation options from siteconfig are as follows: (the settings are from the Intel MKL link advisor plus the fftw3 library)
>>
>> Current settings:
>> RP RP_LIB(SCALAPACK+PBLAS): -L$(MKLROOT)/lib/intel64 $(MKLROOT)/lib/intel64/libmkl_blas95_lp64.a $(MKLROOT)/lib/intel64/libmkl_lapack95_lp64.a -lmkl_scalapack_lp64
>> -lmkl_cdft_core -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -openmp -lpthread -lm -L/opt/local/fftw3/lib/ -lfftw3_mpi -lfftw3 $(R_LIBS)
>> FP FPOPT(par.comp.options): -I$(MKLROOT)/include/intel64/lp64 -I$(MKLROOT)/include -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -DFFTW3 -traceback
>> MP MPIRUN commando : mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_
>>
>> The file parallel_options now reads
>> setenv USE_REMOTE 1
>> setenv MPI_REMOTE 0
>> setenv WIEN_GRANULARITY 1
>> setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"
>>
>>
>> I changed the MPI_REMOTE to 0 as suggested (I was not sure this applied to the Intel MPI environment as the siteconfig prompt only mentioned mich2.
>>
>> As I mentioned the mpirun command seems to work fine. For example, the fftw3 benchmark program gives with 24 processes
>>
>> mpirun -np 24 ./mpi-bench 1024x1024
>> Problem: 1024x1024, setup: 126.32 ms, time: 15.98 ms, ``mflops'': 6562.2
>>
>>
>>
>> On Aug 24, 2012, at 3:05 PM, Peter Blaha wrote:
>>
>>> Hard to say.
>>>
>>> What is in $WIENROOT/parallel_options ?
>>> MPI_REMOTE should be 0 !
>>>
>>> Otherwise run lapw0_mpi by "hand":
>>>
>>> mpirun -np 4 $WIENROOT/lapw0_mpi lapw0.def (or including .machinefile .machine0)
>>>
>>>
>>> Am 24.08.2012 02:24, schrieb Paul Fons:
>>>> Greetings all,
>>>> I have compiled Wien2K 12.1 under OpenSuse 11.4 (and OpenSuse 12.1)
>>>> and the latest Intel compilers with identical mpi launch problems and I
>>>> am hoping for some suggestions as to where to look to fix things. Note
>>>> that the serial and k-point parallel versions of the code run fine (I
>>>> have optimized GaAs a lot in my troubleshooting!).
>>>>
>>>> Environment.
>>>>
>>>> I am using the latest intel fort, icc, and impi libraries for linux.
>>>>
>>>> matstud at pyxis:~/Wien2K> ifort --version
>>>> ifort (IFORT) 12.1.5 20120612
>>>> Copyright (C) 1985-2012 Intel Corporation. All rights reserved.
>>>>
>>>> matstud at pyxis:~/Wien2K> mpirun --version
>>>> Intel(R) MPI Library for Linux* OS, Version 4.0 Update 3 Build 20110824
>>>> Copyright (C) 2003-2011, Intel Corporation. All rights reserved.
>>>>
>>>> matstud at pyxis:~/Wien2K> icc --version
>>>> icc (ICC) 12.1.5 20120612
>>>> Copyright (C) 1985-2012 Intel Corporation. All rights reserved.
>>>>
>>>>
>>>> My OPTIONS files from /siteconfig_lapw
>>>>
>>>> current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
>>>> current:FPOPT:-I$(MKLROOT)/include/intel64/lp64 -I$(MKLROOT)/include -FR
>>>> -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -DFFTW3 -traceback
>>>> current:LDFLAGS:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread
>>>> current:DPARALLEL:'-DParallel'
>>>> current:R_LIBS:-lmkl_lapack95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread
>>>> -lmkl_core -openmp -lpthread
>>>> current:RP_LIBS:-L$(MKLROOT)/lib/intel64
>>>> $(MKLROOT)/lib/intel64/libmkl_blas95_lp64.a
>>>> $(MKLROOT)/lib/intel64/libmkl_lapack95_lp64.a -lmkl_scalapack_lp64
>>>> -lmkl_cdft_core -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core
>>>> -lmkl_blacs_intelmpi_lp64 -openmp -lpthread -lm -L/opt/local/fftw3/lib/
>>>> -lfftw3_mpi -lfftw3 $(R_LIBS)
>>>> current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_
>>>>
>>>>
>>>>
>>>>
>>>> The code compiles and links without error. It runs fine in serial mode
>>>> and in k-point parallel mode, e.g.
>>>>
>>>> .machines with
>>>>
>>>> 1:localhost
>>>> 1:localhost
>>>> 1:localhost
>>>> granularity:1
>>>> extrafine:1
>>>>
>>>> This runs fine. When I attempt to run a mpi process with 12 processes
>>>> (on a 12 core machine), I crash and burn (see below) with a SIGSEV error
>>>> with instructions to contact the developers.
>>>>
>>>> The linking options were derived from Intel's mkl link advisor (the
>>>> version on the intel site. I should add that the mpi-bench in fftw3
>>>> works fine using the intel mpi as do commands like hostname or even
>>>> abinit so it would appear that that the Intel MPI environment itself is
>>>> fine. I have wasted a lot of time trying to figure out how to fix this
>>>> before writing to the list, but at this point, I feel like a monkey at a
>>>> keyboard attempting to duplicate Shakesphere -- if you know what I mean.
>>>> Thanks in advance for any heads up that you can offer.
>>>>
>>>>
>>>>
>>>> .machines
>>>>
>>>> lapw0:localhost:12
>>>> 1:localhost:12
>>>> granularity:1
>>>> extrafine:1
>>>>
>>>>> stop error
>>>>
>>>> error: command /home/matstud/Wien2K/lapw0para -c lapw0.def failed
>>>> 0.029u 0.046s 0:00.93 6.4%0+0k 0+176io 0pf+0w
>>>> Child id 2 SIGSEGV, contact developers
>>>> Child id 8 SIGSEGV, contact developers
>>>> Child id 7 SIGSEGV, contact developers
>>>> Child id 11 SIGSEGV, contact developers
>>>> Child id 10 SIGSEGV, contact developers
>>>> Child id 9 SIGSEGV, contact developers
>>>> Child id 6 SIGSEGV, contact developers
>>>> Child id 5 SIGSEGV, contact developers
>>>> Child id 4 SIGSEGV, contact developers
>>>> Child id 3 SIGSEGV, contact developers
>>>> Child id 1 SIGSEGV, contact developers
>>>> Child id 0 SIGSEGV, contact developers
>>>> -------- .machine0 : 12 processors
>>>>> lapw0 -p(09:04:45) starting parallel lapw0 at Fri Aug 24 09:04:45 JST 2012
>>>>
>>>> cycle 1 (Fri Aug 24 09:04:45 JST 2012) (40/99 to go)
>>>>
>>>> start (Fri Aug 24 09:04:45 JST 2012) with lapw0 (40/99 to go)
>>>>
>>>>
>>>> using WIEN2k_12.1 (Release 22/7/2012) in /home/matstud/Wien2K
>>>> on pyxis with PID 15375
>>>> Calculating GaAs in /usr/local/share/Wien2K/Fons/GaAs
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Wien mailing list
>>>> Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>
>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>
>>>
>>> --
>>> Peter Blaha
>>> Inst.Materials Chemistry
>>> TU Vienna
>>> Getreidemarkt 9
>>> A-1060 Vienna
>>> Austria
>>> +43-1-5880115671
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>> Dr. Paul Fons
>> Senior Research Scientist
>> Functional Nano-phase-change Research Team
>> Nanoelectronics Research Institute
>> National Institute for Advanced Industrial Science & Technology
>> METI
>>
>> AIST Central 4, Higashi 1-1-1
>> Tsukuba, Ibaraki JAPAN 305-8568
>>
>> tel. +81-298-61-5636
>> fax. +81-298-61-2939
>>
>> email: _paul-fons at aist.go.jp <mailto:paul-fons at aist.go.jp>_
>>
>> The following lines are in a Japanese font
>>
>> 〒305-8562 茨城県つくば市つくば中央東 1-1-1
>> 産業技術総合研究所
>> ナノエレクトロニクス研究部門
>> 相変化新規機能デバイス研究チーム
>> 主任研究員
>> ポール・フォンス
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>
> --
>
> P.Blaha
> --------------------------------------------------------------------------
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
> Email: blaha at theochem.tuwien.ac.at WWW: http://info.tuwien.ac.at/theochem/
> --------------------------------------------------------------------------
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
More information about the Wien
mailing list