[Wien] Problem running WIEN2k in parallel

Wei Xie wxie4 at wisc.edu
Mon Oct 11 07:07:10 CEST 2010


Dear WIEN2k developers and users,

We are trying to install WIEN2k 10.1 on a computing cluster and plan to calculate some big system (over 60 atoms/cell) with it.  We got no error message during the compilation, and testing with the three examples (Fccni, TiC and TiO2) in serial finished fast and correctly. However we failed in the parallel (k-point and/ or MPI) mode. Therefore, we write here to this email list hoping someone can offer us some help. Below's the details of our system, compilers, libraries, compiler options, linking flags and testing. 

1. System : SUSE Linux Enterprise Server 10 (x86_64), Intel Xeon X5355 quad core processors (Intel 64),  2 GB memory per core, DDR 4X InfiniBand, PBS Professional queuing system. 

2. compiler/libraries: ifort and icc of Intel 11.1/046, mpiifort of Intel MPI 3.2.0.011, BLAS, LAPACK and scaLAPCK of Intel MKL 10.2, and fftw 2.1.5 (compiled with "--enable-mpi" switch at /home/user/fftw-2.1.5)
The environment was configured by source in bash_profile:
source /usr/local/intel/Compiler/11.1/046/bin/ifortvars.sh intel64           #ifort
source /usr/local/intel/Compiler/11.1/046/mkl/tools/environment/mklvarsem64t.sh         #mkl
source /usr/local/intel/impi/3.2.0.011/bin64/mpivars.sh             #mpi
Their bin, library, and include directory were all sourced in bash_profile as well. 

3. Compiler options: 
For serial:
 O   Compiler options:        -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
 L   Linker Flags:            $(FOPT) -L/opt/intel/Compiler/11.1/046/mkl/lib/em64t -pthread
 P   Preprocessor flags       '-DParallel'
 R   R_LIB (LAPACK+BLAS):     -lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide

For parallel:
Shared Memory Architecture: no; 
Remote shell: ssh (password-less log-in enabled);
RP  -L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -lmkl_scalapack_lp64 /usr/local/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_solver_lp64.a -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -Wl,--end-group -openmp -lpthread -L/home/user/fftw-2.1.5/lib -lfftw_mpi -lfftw $(R_LIBS)
 FP  FPOPT(par.comp.options): -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
 MP  MPIRUN commando        : mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_

Note: We used all WIEN2k recommended options/flags except RP for which we used those from Intel MKL linking Advisor by specifying dynamic, 32-bit (lp64) and multi-threaded etc. We're not sure if these are correct (especially the integers length) and would like to here your suggestion. You may find our processors' specifications are at http://ark.intel.com/Product.aspx?id=28035 . 

4. Testing
4.1 Inputs
We used userconfig_lapw to set the user environment (especially, scratch director is set to be /scratch), and then performed the testing using the Fccni example downloaded from the WIEN2k website. 

We first ran a spin-polarized calculation in serial using the recommended parameters from the User's Guide for the initialization. The calculation finished without problem quickly and the results matched the downloaded outputs well. We then ran save_lapw and clean_lapw so that we can use the same set of input files to test parallelization. We wrote a submission script to create the .machines file and calculate the number of processors allocated ($nprocs) on the fly and start the calculation with: mpirun -np $nprocs runsp_lapw -p -ec 0.0001 -cc 0.0001. We enabled hybrid parallelization (i.e., both k-point and MPI) in this case.

The .machines file created reads: 
1:r1i0n0:8
1:r1i0n1:8
lapw0: r1i0n0:8 r1i0n1:8 
lapw1: r1i0n0:8 r1i0n1:8 
lapw2: r1i0n0:8 r1i0n1:8 
granularity:1
extrafine:1

In this example we were allocated two nodes (r1i0n0 and r1i0n1) by PBS, each with 8 cores  (each node is made of two quad-core CPUs which together make 8 cores). The first two lines are for k-point and the next three for MPI (for lapw0, lapw1, lapw2, respectively). 

4.2 Outputs
The job was killed within one minute outputting error messages like:
~ cat aU_SOC.e799326
rm: cannot remove `fccni.vspup': No such file or directory
rm: cannot remove `fccni.vspdn': No such file or directory
rm: cannot remove `fccni.vnsup': No such file or directory
rm: cannot remove `fccni.vnsdn': No such file or directory
/tmp/pbs.799326.service2/sh.piTkRT: No such file or directory.
/tmp/pbs.799326.service2/sh.ygkvzW: No such file or directory.
/tmp/pbs.799326.service2/sh.i4xOi2: No such file or directory.
mv: cannot stat `.tmp': No such file or directory
foreach: No match.
/tmp/pbs.799326.service2/sh.m3zD88: No such file or directory.
/tmp/pbs.799326.service2/sh.xgo6Fb: No such file or directory.
/tmp/pbs.799326.service2/sh.zyICya: No such file or directory.
/tmp/pbs.799326.service2/sh.fI8qUa: No such file or directory.
/tmp/pbs.799326.service2/sh.cghNSa: No such file or directory.
foreach: No match.
mv: cannot stat `.tmp': No such file or directory
rm: No match.
rm: cannot remove `fccni.vns': No such file or directory
rm: cannot remove `fccni.vnsup': No such file or directory
rm: cannot remove `fccni.vnsdn': No such file or directory
rm: cannot remove `fccni.vsp': No such file or directory
rm: cannot remove `fccni.vspdn': No such file or directory
sed: can't read .machinetmp22: No such file or directory
rm: cannot remove `.machinetmp': No such file or directory
machine_i: Subscript out of range.
cut: .machine0: No such file or directory
rm: cannot remove `.machinetmp22': No such file or directory
sed: can't read .machinetmp: No such file or directory
rm: cannot remove `.machinetmp': No such file or directory
mv: cannot stat `.tmp': No such file or directory
 LAPW0 END
 LAPW0 END
@: Expression Syntax.

It seemed that the job stopped when executing LAPW0 because WIEN2k couldn't find/move/delete some files. 

We have tried a couple of different compilations (e.g., using exactly what WIEN2k recommended for RP) but these errors persist. We have also searched the WIEN2k mail list but didn't find any related post. 

Does anyone have any idea on this? Your comments will be highly appreciated! 

Thanks,
Wei 
-------------------------------------------
Computational Materials Group
University of Wisconsin-Madison 
209 MS&E Bldg, 1509 University Ave 
Madison, WI 53706-1595
Office: (608)262-2088 
Email: wxie4 at wisc.edu
Web: http://matmodel.engr.wisc.edu/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20101011/1a5ee91c/attachment.htm>


More information about the Wien mailing list