[Wien] Question about MPI parallelization
    Kyoo Kim 
    kyoo at physics.rutgers.edu
       
    Fri Sep 25 05:23:04 CEST 2009
    
    
  
Dear All, 
We are trying to calculate a big system with more than 200 atoms with MPI parallelization and we started with RKmax=3.5~4.
GGA potential is used and it is magnetic system without inversion.
We had MPI errors and have been tried to resolve this issue already a couple of weeks and still have not resolved the problems.
 
We have the following setup.
---------------------------------------------------------------------------------------------
1)
i)       Hardware Xeon 5400 Series  with Gigabit connection and 1-3 GB of memory per core 
ii)     Opterons dual core  Processor 8214 connected via Infiniband
 
2)  Latest WIEN2K downloaded from the Wien2k website.
    Software: OS == SuSE 10.2
    Compilers: Intel 9.1 & Intel 10.1 
    Libraries: MKL 9.0 & 10.2 
                   BLACS & SCALAPACK for MKL 9.0 compiled separately and taken from intel MKL package for MKL 10.1
                   FFTW 2 & 3
                   Mpich 1.2.6 we have but not tested yet mpich2, opterons have mvapich 0.9.9  
-----------------------------------------------------------------------------------------------
 
 
And test results are:
>>> first we compiled MKL 9.0 + Intel fortran 9.0 + BLACS+SCALAPACK +FFTW2&3
+MPICH1    on XEON machines,
  And MKL 9.0 + Intel fortran 9.0 + BLACS+SCALAPACK
+FFTW2&3 + MVAPICH on OPTERON machines
Both gave us similar results:
 
LAPW0 was OK every time.
 
LAPW1 gives us the following problems:
-----------------------------------------------------------------------------
8 - <NO ERROR MESSAGE> : Could not convert index 101833776 into a pointer 
The index may be an incorrect argument.
Possible sources of this problem are a missing "include 'mpif.h'", 
a misspelled MPI object (e.g., MPI_COM_WORLD instead of MPI_COMM_WORLD) 
or a misspelled user variable for an MPI object (e.g., com instead of comm).
[8] [] Aborting Program!
-----------------------------------------------------------------------------
 
 
OUTPUT from dayfile is :
-------------------------------------------------------------------------
Calculating mag in /mnt/parallel/30109udo/mag on n109 with PID 27390
 
    start       (Wed Sep 23 23:34:30 EDT 2009) with lapw0 (40/99 to go)
----
>   lapw1  -c -up -p    (23:39:26) starting parallel lapw1 at Wed Sep 23
23:39:26 EDT 2009
->  starting parallel LAPW1 jobs at Wed Sep 23 23:39:26 EDT 2009
running LAPW1 in parallel mode (using .machines)
1 number_of_parallel_jobs
     n107 n107 n107 n107 n107 n107 n107 n107 n109 n109 n109 n109 n109 n109
n109 n109 n110 n110 n110 n110 n110
n110 n110 n110(5) command: n107 cd /parallel/30109udo/mag;time /opt/mvapich/intel_ud/bin/mpirun 
-np 24 -machin efile .machine1 /home/wien2k_09_ib//lapw1c_mpi uplapw1_1.def;rm -f.lock_n1071
rh: n107
cmd: cd /parallel/30109udo/mag;time /opt/mvapich/intel_ud/bin/mpirun -np 24 -machinefile .machine1 
/home/wien2 k_09_ib//lapw1c_mpi uplapw1_1.def;rm -f .lock_n1071 /opt/SGE/bin/lx24-amd64/qrsh -V 
-inherit n107 cd /parallel/30109udo/mag;time /opt/mvapich/intel_ud/bin/mpirun -np 24 -machinefile 
.machine1 /home/wien2k_09_ib//lapw1c_mpi uplapw1_1.def;rm -f .lock_n1071
**  LAPW1 crashed!
0.084u 0.152s 0:07.48 3.0%      0+0k 0+0io 12pf+0w
error: command   /home/wien2k_09_ib/lapw1cpara -up -c uplapw1.def   failed
 
>   stop error
 
 
 
>>> If we change  in BLACS configuration  line:
      TRANSCOMM = -DUseMpich -DPOINTER_64_BITS=1  
      to
      TRANSCOMM = -DUseMpich 
      
it compiles and kind of works but it can crash usually after the first iteration 
given in vector files and in output2 files " NaN " which then causes LAPW2 to fail.
 
For example for Inifiniband version output looks like this:
 
 LAPW0 END
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
 
lapw1c_mpi         000000000097ADB8  Unknown               Unknown  Unknown
lapw1c_mpi         00000000009793FD  Unknown               Unknown  Unknown
lapw1c_mpi         00000000004A8EE3  Unknown               Unknown  Unknown
lapw1c_mpi         00000000004E811C  Unknown               Unknown  Unknown
lapw1c_mpi         00000000004C31B2  Unknown               Unknown  Unknown
lapw1c_mpi         0000000000475538  Unknown               Unknown  Unknown
lapw1c_mpi         000000000048F18D  Unknown               Unknown  Unknown
lapw1c_mpi         00000000004549CC  seclr4_                   280
seclr4_tmp_.F
lapw1c_mpi         0000000000416BD0  calkpt_                   200
calkpt_tmp_.F
lapw1c_mpi         000000000043C77D  MAIN__                     60
lapw1_tmp_.F
lapw1c_mpi         000000000040F0CA  Unknown               Unknown  Unknown
libc.so.6          00002AFA1178FAE4  Unknown               Unknown  Unknown
lapw1c_mpi         000000000040F015  Unknown               Unknown  Unknown
 
2) similar story we got with intel 10.1 compiles + MKL 10.2 + MKL 10.2 blacs and scalalapack + mpich1.
 
We need to mention that for smaller cases (a few atoms) with inversion symmetry (REAL)
lapw1_mpi worked.
 
yesterday we found this link where similar problem mentioned and it was claimed a MPICH1  bug 
which could be to a certain extend patched but ...
http://www.pgroup.com/resources/mpich/mpich126_pgi60.htm#ISSUES
(item Known issues).
 
============================================================================
S U M M A R Y :
 
what kind of compilation options we should use to work with such a big systems.
And will it be possible to use Gigabit for such a computation or is Infiniband mandatory?
What is hardware requirements for system having like 200-500 atoms?
 
We summarize the options for our Wien2K related compilations:
 
1) Gigabit compilation:
COMPILER :/opt/intel/fce/10.1/bin/ifort
COMPILERC: cc
COMPILERP: /opt/mpich-1.2.6-intel-10.1/bin/mpif90
 
FOPT:-FR -mp1 -w -prec_div -pc80 -pad -align -DINTEL_VML -traceback
FPOPT:$(FOPT)
LDFLAGS:$(FOPT) -L/opt/intel/mkl/10.0/lib/em64t -lpthread -i-static 
DPARALLEL:'-DParallel'
R_LIBS:-L/opt/intel/mkl/10.0/lib/em64t/ -lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread 
-lmkl_em64t -lguide -lmkl_core -lpthread 
RP_LIBS:-L/opt/intel/mkl/10.0/lib/em64t/ -lmkl_intel_lp64 -lmkl_scalapack_lp64 -lmkl_blacs_lp64 
-lmkl_sequential $(R_LIBS) -L /home/lib64/fftw/lib/ -lfftw_mpi -lfftw 
MPIRUN:/opt/mpich-1.2.6-intel-10.1/bin/mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_
 
 
2) Infiniband compilation
 
COMPILER   :/opt/intel/fce/9.1/bin/ifort
COMPILERC:cc
COMPILERP : /opt/mvapich/intel_ud/bin/mpif90
 
FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback current:FPOPT:-FR 
-mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
LDFLAGS:$(FOPT) -L/opt/intel/mkl/9.0/lib/em64t -lpthread current:DPARALLEL:'-DParallel'
R_LIBS:-L/opt/intel/mkl/9.0/lib/em64t -lmkl_lapack64 -lmkl_em64t -lguide -lvml -pthread 
RP_LIBS:-L/home/lib64/ib/SCALAPACK -lscalapack /home/lib64/ib/BLACS/LIB/blacsCinit_MPI-LINUX-0.a
/home/lib64/ib/BLACS/LIB/blacsF77init_MPI-LINUX-0.a /home/lib64/ib/BLACS/LIB/blacs_MPI-LINUX-0.a 
$(R_LIBS) -L/home/lib64/ib/fftw/lib -lfftw3 -lfftw_mpi -lfftw -lm -i-static MPIRUN:/opt/mvapich/intel_ud/bin/mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_
 
 
Relevant BLACS and SCALAPACK OPTIONS:
BLACS:
 
home=/home/lib64/ib
HOME=/home/lib64/ib
   COMMLIB = MPI
   PLAT = LINUX
 
   MPIdir =   /opt/mvapich/intel_ud
   MPILIBdir = $(MPIdir)/lib/
   MPIINCdir = $(MPIdir)/include
   MPILIB = $(MPILIBdir)/libmpich.a
 
   INTFACE = -DAdd_
   TRANSCOMM = -DUseMpich -DPOINTER_64_BITS=1 
 
   F77            = ifort
   F77NO_OPTFLAGS = -O0
   F77FLAGS       = $(F77NO_OPTFLAGS) -O
   F77LOADER      = $(F77)
   F77LOADFLAGS   = 
   CC            = gcc
   CCFLAGS        = -O4
   CCLOADER       = $(CC)
   CCLOADFLAGS    = 
 
 
SCALAPACK:
 
SHELL         = /bin/sh
home          = /home/lib64/ib/SCALAPACK
 
PLAT          = LINUX
BLACSDBGLVL   = 0
BLACSdir      = /home/lib64/ib/BLACS/LIB
 
USEMPI        = -DUsingMpiBlacs
SMPLIB        = /opt/mvapich/intel_ud/lib/libmpich.a
BLACSFINIT    = $(BLACSdir)/blacsF77init_MPI-LINUX-0.a
BLACSCINIT    = $(BLACSdir)/blacsCinit_MPI-LINUX-0.a
BLACSLIB      = $(BLACSdir)/blacs_MPI-LINUX-0.a
TESTINGdir    = $(home)/TESTING
 
F77           = /opt/mvapich/intel_ud/bin/mpif90
CC            = /opt/mvapich/intel_ud/bin/mpicc
F77FLAGS      = -O3 $(NOOPT)
CCFLAGS       = -O3
SRCFLAG       =
F77LOADER     = $(F77)
CCLOADER      = $(CC)
F77LOADFLAGS  =
CCLOADFLAGS   =
CDEFS         = -DAdd_ -DNO_IEEE $(USEMPI)
ARCH          = ar
ARCHFLAGS     = cr
RANLIB        = ranlib
SCALAPACKLIB  = $(home)/libscalapack.a
LAPACKLIB     =  -L/opt/intel/mkl/9.0/lib/em64t/ -lmkl_lapack64 -lmkl_em64t
-lguide -lvml -pthread  
 
Please help us!
Thank you very much for any suggestion, tip or anything what can help us to resolve this issue.
 
Kyoo and Viktor.
Department of Physics.
Rutgers.
 
 
 
 
    
    
More information about the Wien
mailing list