[Wien] MPI run problem

Laurence Marks L-marks at northwestern.edu
Sun May 12 01:00:20 CEST 2013


Hmmm, more information but not useful. I don't see anything obviously
wrong with what you are doing. Please regress to something simple
(e.g. TiC) -- I know if is not useful to run this with mpi but for a
test it is useful to verify things.

Also, check using
http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor

For reference, the last time I compiled against openmpi my options
were (for a mainly static compilation):
FOPT =  -FR -I$(MKLINC) -r8 -pc80 -mp1 -prec_div -fpconstant
-traceback -pad -align -O2 -ipo1 -DINTEL_VML -i-static -fminshared
-xHost -thread -assu buff -no-complex-limited-range
FPOPT = $(FOPT) -I$(FFTW)/include
DParallel = '-DParallel'
FGEN = $(PARALLEL)
LDFLAGS = $(FOPT) -L$(MKLPATH) -pthread -i-static
R_LIBS = $(FOPT) -Wl,--start-group $(MKLPATH)/libmkl_intel_lp64.a
$(MKLPATH)/libmkl_intel_thread.a $(MKLPATH)/libmkl_core.a
-Wl,--end-group -lpthread -
static -liomp5
C_LIBS = $(R_LIBS)
RP_LIBS =  $(FOPT) -L$(FFTW)/lib -lfftw_mpi -lfftw
$(MKLPATH)/libmkl_scalapack_lp64.a -Wl,--start-group
$(MKLPATH)/libmkl_intel_lp64.a $(MKLPATH)/libmkl_sequential.a
$(MKLPATH)/libmkl_core.a $(MKLPATH)/libmkl_blacs_openmpi_lp64.a
-Wl,--end-group -lpthread -liomp5
CP_LIBS = $(RP_LIBS)


On Sat, May 11, 2013 at 4:18 PM, alonofrio at comcast.net
<alonofrio at comcast.net> wrote:
> Hello again,
>
> I commented the line call W2kinit, and now I have a more descriptive message
> but I am still lost about it. Not sure if it's that its not finding some
> libraries or if is that the environments variables are not being propagated
> to all nodes.
>
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image              PC                Routine            Line        Source
> libmpi.so.1        00002B746414FF7A  Unknown               Unknown  Unknown
> lapw1c_mpi         00000000004E9192  Unknown               Unknown  Unknown
> libmkl_scalapack_  00002B746330B231  Unknown               Unknown  Unknown
>
> Any ideas? I'm so sorry for all the questions
>
> David Guzman
>
> On May 11, 2013, at 4:46 PM, Laurence Marks <L-marks at northwestern.edu>
> wrote:
>
> The addition of the signal trapping in Wien2k (W2kinit in lapw[0-2].F
> and others) has a plus, and a minus. The pluses are that the weekly
> emails on the list about ulimit associated crashes, and also (perhaps
> not so obvious) that mpi tasks die more gracefully. Unfortunately it
> also can make knowing what is wrong with an mpi job less than clear.
>
> I suggest (and others should do the same as needed) that you comment
> out the "call W2kinit" in lapw1, recompile just lapw1 then try again
> -- hopefully you will get a more human understandable message.
> IMPORTANT: check hansen-b004 to ensure that you do not have any zombie
> processes still running; depending upon what version of ssh you are
> running you may have them hanging around.
>
> On Sat, May 11, 2013 at 3:31 PM, alonofrio at comcast.net
> <alonofrio at comcast.net> wrote:
>
> Thanks Professor Marks,
> I corrected my compiler option, I had a mistake with the openmpi blacs
> library. However I still get errors when trying to run. It always stops when
> it starts lapw1.
> Now is giving me this error
> w2k_dispatch_signal(): received: Segmentation fault
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
> with errorcode 0.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> and when I look at the case.dayfile I see this:
>
> hansen-b004 hansen-b004 hansen-b004 hansen-b004(5)  Child id           0
> SIGSEGV, contact developers
> Child id           1 SIGSEGV, contact developers
> Child id           2 SIGSEGV, contact developers
> Child id           3 SIGSEGV, contact developers
> 0.085u 0.365s 0:01.49 29.5%     0+0k 0+0io 57pf+0w
>
> Thanks for your help. Any comments are well appreciated.
>
> Alex Onofrio
> Departamento de Fisica
> Univesidad de Los Andes
> Bogota, Colombia
>
>
>
> On May 11, 2013, at 1:43 PM, Laurence Marks <L-marks at northwestern.edu>
> wrote:
>
> You need to use the openmpi blacs. Please check the intel compilation
> assistance webpage (previously posted, so check the list).
>
> On May 11, 2013 12:10 PM, "alonofrio at comcast.net" <alonofrio at comcast.net>
> wrote:
>
>
> Wien2k User
>
> I am trying to get the MPI capabilities of Wien running, but I got into
> some complication.
> The whole compilation process goes fine with no errors, but when I try to
> run the code through run_lapw it stops at the begining of the lapw1 program
> with the following error:
>
> w2k_dispatch_signal(): received: Segmentation fault
> MPI_ABORT was invoked on rank 7 in communicator MPI_COMM_WORLD
> with errorcode 28607820.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 7 with PID 60685 on
> node carter-a355.rcac.purdue.edu exiting without calling "finalize". This
> may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
>
> This repeats the same number of times as the number of processors
> submitted as mpi jobs.
>
> Here are my complilation options as shown in the OPTIONS file:
>
> current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
> current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
> current:LDFLAGS:$(FOPT) -L$(MKLROOT)/lib/em64t -pthread
> current:DPARALLEL:'-DParallel'
> current:R_LIBS:-lmkl_blas95_lp64 -lmkl_lapack95_lp64
> $(MKLROOT)/lib/em64t/libmkl
> _scalapack_lp64.a -Wl,--start-group
> $(MKLROOT)/lib/em64t/libmkl_cdft_core.a $(MK
> LROOT)/lib/em64t/libmkl_intel_lp64.a
> $(MKLROOT)/lib/em64t/libmkl_intel_thread.a
> $(MKLROOT)/lib/em64t/libmkl_core.a
> $(MKLROOT)/lib/em64t/libmkl_blacs_intelmpi_lp
> 64.a -Wl,--end-group -openmp -lpthread
> current:RP_LIBS:-lmkl_scalapack_lp64 -lmkl_solver_lp64 -lmkl_blacs_lp64
> -L/apps/
> rhel6/fftw-3.3.1/openmpi-1.4.4_intel-12.0.084/lib -lfftw3_mpi -lfftw3
> $(R_LIBS)
> current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_
>
> and these are the options in the parallel_options file:
> setenv USE_REMOTE 1
> setenv MPI_REMOTE 0
> setenv WIEN_GRANULARITY 1
> setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH -np _NP_
> -hostfile _HOSTS_ _EXEC_"
>
> I compiled the code with intel 12.0.084, openmpi 1.4.4 (compiled with
> intel 12.0.084) and fftw 3.3.1 (compiled with intel 12.0.084 and openmpi
> 1.4.4.
> I am trying to run the code in the university cluster which has infiniband
> and intel xeon-E5.
>
> I hope this information is enough for any of you to point me to the
> problem.
>
> Thanks so much for your time
>
> Alex Onofrio
> Departamento de Fisica
> Univesidad de Los Andes
> Bogota, Colombia
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
>
>
>
> --
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> www.numis.northwestern.edu 1-847-491-3996
> "Research is to see what everybody else has seen, and to think what
> nobody else has thought"
> Albert Szent-Gyorgi
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html



-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what
nobody else has thought"
Albert Szent-Gyorgi


More information about the Wien mailing list