[Wien] Question about MPI parallelization

Kyoo Kim kyoo at physics.rutgers.edu
Fri Sep 25 05:23:04 CEST 2009


Dear All, 

We are trying to calculate a big system with more than 200 atoms with MPI parallelization and we started with RKmax=3.5~4.

GGA potential is used and it is magnetic system without inversion.

We had MPI errors and have been tried to resolve this issue already a couple of weeks and still have not resolved the problems.

 

We have the following setup.

---------------------------------------------------------------------------------------------

1)

i)       Hardware Xeon 5400 Series  with Gigabit connection and 1-3 GB of memory per core 

ii)     Opterons dual core  Processor 8214 connected via Infiniband

 

2)  Latest WIEN2K downloaded from the Wien2k website.

    Software: OS == SuSE 10.2

    Compilers: Intel 9.1 & Intel 10.1 

    Libraries: MKL 9.0 & 10.2 

                   BLACS & SCALAPACK for MKL 9.0 compiled separately and taken from intel MKL package for MKL 10.1

                   FFTW 2 & 3

                   Mpich 1.2.6 we have but not tested yet mpich2, opterons have mvapich 0.9.9  

-----------------------------------------------------------------------------------------------

 

 

And test results are:

>>> first we compiled MKL 9.0 + Intel fortran 9.0 + BLACS+SCALAPACK +FFTW2&3

+MPICH1    on XEON machines,

  And MKL 9.0 + Intel fortran 9.0 + BLACS+SCALAPACK

+FFTW2&3 + MVAPICH on OPTERON machines

Both gave us similar results:

 

LAPW0 was OK every time.

 

LAPW1 gives us the following problems:

-----------------------------------------------------------------------------

8 - <NO ERROR MESSAGE> : Could not convert index 101833776 into a pointer 

The index may be an incorrect argument.

Possible sources of this problem are a missing "include 'mpif.h'", 

a misspelled MPI object (e.g., MPI_COM_WORLD instead of MPI_COMM_WORLD) 

or a misspelled user variable for an MPI object (e.g., com instead of comm).

[8] [] Aborting Program!

-----------------------------------------------------------------------------

 

 

OUTPUT from dayfile is :

-------------------------------------------------------------------------

Calculating mag in /mnt/parallel/30109udo/mag on n109 with PID 27390

 

    start       (Wed Sep 23 23:34:30 EDT 2009) with lapw0 (40/99 to go)

----

>   lapw1  -c -up -p    (23:39:26) starting parallel lapw1 at Wed Sep 23

23:39:26 EDT 2009

->  starting parallel LAPW1 jobs at Wed Sep 23 23:39:26 EDT 2009

running LAPW1 in parallel mode (using .machines)

1 number_of_parallel_jobs

     n107 n107 n107 n107 n107 n107 n107 n107 n109 n109 n109 n109 n109 n109

n109 n109 n110 n110 n110 n110 n110

n110 n110 n110(5) command: n107 cd /parallel/30109udo/mag;time /opt/mvapich/intel_ud/bin/mpirun 

-np 24 -machin efile .machine1 /home/wien2k_09_ib//lapw1c_mpi uplapw1_1.def;rm -f.lock_n1071

rh: n107

cmd: cd /parallel/30109udo/mag;time /opt/mvapich/intel_ud/bin/mpirun -np 24 -machinefile .machine1 

/home/wien2 k_09_ib//lapw1c_mpi uplapw1_1.def;rm -f .lock_n1071 /opt/SGE/bin/lx24-amd64/qrsh -V 

-inherit n107 cd /parallel/30109udo/mag;time /opt/mvapich/intel_ud/bin/mpirun -np 24 -machinefile 

.machine1 /home/wien2k_09_ib//lapw1c_mpi uplapw1_1.def;rm -f .lock_n1071

**  LAPW1 crashed!

0.084u 0.152s 0:07.48 3.0%      0+0k 0+0io 12pf+0w

error: command   /home/wien2k_09_ib/lapw1cpara -up -c uplapw1.def   failed

 

>   stop error

 

 

 

>>> If we change  in BLACS configuration  line:

      TRANSCOMM = -DUseMpich -DPOINTER_64_BITS=1  

      to

      TRANSCOMM = -DUseMpich 

      

it compiles and kind of works but it can crash usually after the first iteration 

given in vector files and in output2 files " NaN " which then causes LAPW2 to fail.

 

For example for Inifiniband version output looks like this:

 

 LAPW0 END

forrtl: severe (174): SIGSEGV, segmentation fault occurred

Image              PC                Routine            Line        Source

 

lapw1c_mpi         000000000097ADB8  Unknown               Unknown  Unknown

lapw1c_mpi         00000000009793FD  Unknown               Unknown  Unknown

lapw1c_mpi         00000000004A8EE3  Unknown               Unknown  Unknown

lapw1c_mpi         00000000004E811C  Unknown               Unknown  Unknown

lapw1c_mpi         00000000004C31B2  Unknown               Unknown  Unknown

lapw1c_mpi         0000000000475538  Unknown               Unknown  Unknown

lapw1c_mpi         000000000048F18D  Unknown               Unknown  Unknown

lapw1c_mpi         00000000004549CC  seclr4_                   280

seclr4_tmp_.F

lapw1c_mpi         0000000000416BD0  calkpt_                   200

calkpt_tmp_.F

lapw1c_mpi         000000000043C77D  MAIN__                     60

lapw1_tmp_.F

lapw1c_mpi         000000000040F0CA  Unknown               Unknown  Unknown

libc.so.6          00002AFA1178FAE4  Unknown               Unknown  Unknown

lapw1c_mpi         000000000040F015  Unknown               Unknown  Unknown

 

2) similar story we got with intel 10.1 compiles + MKL 10.2 + MKL 10.2 blacs and scalalapack + mpich1.

 

We need to mention that for smaller cases (a few atoms) with inversion symmetry (REAL)

lapw1_mpi worked.

 

yesterday we found this link where similar problem mentioned and it was claimed a MPICH1  bug 

which could be to a certain extend patched but ...

http://www.pgroup.com/resources/mpich/mpich126_pgi60.htm#ISSUES

(item Known issues).

 

============================================================================

S U M M A R Y :

 

what kind of compilation options we should use to work with such a big systems.

And will it be possible to use Gigabit for such a computation or is Infiniband mandatory?

What is hardware requirements for system having like 200-500 atoms?

 

We summarize the options for our Wien2K related compilations:

 

1) Gigabit compilation:

COMPILER :/opt/intel/fce/10.1/bin/ifort

COMPILERC: cc

COMPILERP: /opt/mpich-1.2.6-intel-10.1/bin/mpif90

 

FOPT:-FR -mp1 -w -prec_div -pc80 -pad -align -DINTEL_VML -traceback

FPOPT:$(FOPT)

LDFLAGS:$(FOPT) -L/opt/intel/mkl/10.0/lib/em64t -lpthread -i-static 

DPARALLEL:'-DParallel'

R_LIBS:-L/opt/intel/mkl/10.0/lib/em64t/ -lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread 

-lmkl_em64t -lguide -lmkl_core -lpthread 

RP_LIBS:-L/opt/intel/mkl/10.0/lib/em64t/ -lmkl_intel_lp64 -lmkl_scalapack_lp64 -lmkl_blacs_lp64 

-lmkl_sequential $(R_LIBS) -L /home/lib64/fftw/lib/ -lfftw_mpi -lfftw 

MPIRUN:/opt/mpich-1.2.6-intel-10.1/bin/mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_

 

 

2) Infiniband compilation

 

COMPILER   :/opt/intel/fce/9.1/bin/ifort

COMPILERC:cc

COMPILERP : /opt/mvapich/intel_ud/bin/mpif90

 

FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback current:FPOPT:-FR 

-mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback

LDFLAGS:$(FOPT) -L/opt/intel/mkl/9.0/lib/em64t -lpthread current:DPARALLEL:'-DParallel'

R_LIBS:-L/opt/intel/mkl/9.0/lib/em64t -lmkl_lapack64 -lmkl_em64t -lguide -lvml -pthread 

RP_LIBS:-L/home/lib64/ib/SCALAPACK -lscalapack /home/lib64/ib/BLACS/LIB/blacsCinit_MPI-LINUX-0.a

/home/lib64/ib/BLACS/LIB/blacsF77init_MPI-LINUX-0.a /home/lib64/ib/BLACS/LIB/blacs_MPI-LINUX-0.a 

$(R_LIBS) -L/home/lib64/ib/fftw/lib -lfftw3 -lfftw_mpi -lfftw -lm -i-static MPIRUN:/opt/mvapich/intel_ud/bin/mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_

 

 

Relevant BLACS and SCALAPACK OPTIONS:

BLACS:

 

home=/home/lib64/ib

HOME=/home/lib64/ib

   COMMLIB = MPI

   PLAT = LINUX

 

   MPIdir =   /opt/mvapich/intel_ud

   MPILIBdir = $(MPIdir)/lib/

   MPIINCdir = $(MPIdir)/include

   MPILIB = $(MPILIBdir)/libmpich.a

 

   INTFACE = -DAdd_

   TRANSCOMM = -DUseMpich -DPOINTER_64_BITS=1 

 

   F77            = ifort

   F77NO_OPTFLAGS = -O0

   F77FLAGS       = $(F77NO_OPTFLAGS) -O

   F77LOADER      = $(F77)

   F77LOADFLAGS   = 

   CC            = gcc

   CCFLAGS        = -O4

   CCLOADER       = $(CC)

   CCLOADFLAGS    = 

 

 

SCALAPACK:

 

SHELL         = /bin/sh

home          = /home/lib64/ib/SCALAPACK

 

PLAT          = LINUX

BLACSDBGLVL   = 0

BLACSdir      = /home/lib64/ib/BLACS/LIB

 

USEMPI        = -DUsingMpiBlacs

SMPLIB        = /opt/mvapich/intel_ud/lib/libmpich.a

BLACSFINIT    = $(BLACSdir)/blacsF77init_MPI-LINUX-0.a

BLACSCINIT    = $(BLACSdir)/blacsCinit_MPI-LINUX-0.a

BLACSLIB      = $(BLACSdir)/blacs_MPI-LINUX-0.a

TESTINGdir    = $(home)/TESTING

 

F77           = /opt/mvapich/intel_ud/bin/mpif90

CC            = /opt/mvapich/intel_ud/bin/mpicc

F77FLAGS      = -O3 $(NOOPT)

CCFLAGS       = -O3

SRCFLAG       =

F77LOADER     = $(F77)

CCLOADER      = $(CC)

F77LOADFLAGS  =

CCLOADFLAGS   =

CDEFS         = -DAdd_ -DNO_IEEE $(USEMPI)

ARCH          = ar

ARCHFLAGS     = cr

RANLIB        = ranlib

SCALAPACKLIB  = $(home)/libscalapack.a

LAPACKLIB     =  -L/opt/intel/mkl/9.0/lib/em64t/ -lmkl_lapack64 -lmkl_em64t

-lguide -lvml -pthread  

 

Please help us!

Thank you very much for any suggestion, tip or anything what can help us to resolve this issue.

 

Kyoo and Viktor.

Department of Physics.

Rutgers.

 

 

 

 


More information about the Wien mailing list