[Wien] LAPW2 crashed when running in parallel

Wei Xie wxie4 at wisc.edu
Mon Nov 1 00:56:47 CET 2010


Dear all WIEN2k community members:

We encountered some problem when running in parallel (K-point, MPI or both)--the calculations crashed at LAPW2. Note we had no problem running it in serial. We have tried to diagnose the problem, recompile the code with difference options and test with difference cases and parameters based on similar problems reported on the mail list, but the problem persists. So we write here hoping someone can offer us some suggestion. We have attached related files below for your reference. Your replies are appreciated in advance! 

This is a TiC example running in both Kpoint and MPI parallel on two nodes r1i0n0 and r1i0n1 (8cores/node):

1. stdout (abridged) 
MPI: invalid option -machinefile
real	0m0.004s
user	0m0.000s
sys	0m0.000s
...
MPI: invalid option -machinefile
real	0m0.003s
user	0m0.000s
sys	0m0.004s
TiC.scf1up_1: No such file or directory.

LAPW2 - Error. Check file lapw2.error
cp: cannot stat `.in.tmp': No such file or directory
rm: cannot remove `.in.tmp': No such file or directory
rm: cannot remove `.in.tmp1': No such file or directory

2. TiC.dayfile (abridged) 
...
    start 	(Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go)
    cycle 1 	(Sun Oct 31 16:25:06 MDT 2010) 	(40/99 to go)

>   lapw0 -p	(16:25:06) starting parallel lapw0 at Sun Oct 31 16:25:07 MDT 2010
-------- .machine0 : 16 processors
invalid "local" arg: -machinefile

0.436u 0.412s 0:04.63 18.1%	0+0k 2600+0io 1pf+0w
>   lapw1  -up -p   	(16:25:12) starting parallel lapw1 at Sun Oct 31 16:25:12 MDT 2010
->  starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010
running LAPW1 in parallel mode (using .machines)
2 number_of_parallel_jobs
     r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)      r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1)      r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)    Summary of lapw1para:
   r1i0n0	 k=0	 user=0	 wallclock=0
   r1i0n1	 k=0	 user=0	 wallclock=0
...
0.116u 0.316s 0:10.48 4.0%	0+0k 0+0io 0pf+0w
>   lapw2 -up -p  	(16:25:34) running LAPW2 in parallel mode
**  LAPW2 crashed!
0.032u 0.104s 0:01.13 11.5%	0+0k 82304+0io 8pf+0w
error: command   /home/xiew/WIEN2k_10/lapw2para -up uplapw2.def   failed

3. uplapw2.error 
Error in LAPW2
 'LAPW2' - can't open unit: 18                                                
 'LAPW2' -        filename: TiC.vspup                                         
 'LAPW2' -          status: old          form: formatted                      
**  testerror: Error in Parallel LAPW2

4. .machines
#
1:r1i0n0:8
1:r1i0n1:8
lapw0:r1i0n0:8 r1i0n1:8 
granularity:1
extrafine:1

5. compilers, MPI and options
Intel Compilers  and MKL 11.1.046
Intel MPI 3.2.0.011

current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
current:LDFLAGS:$(FOPT) -L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -pthread
current:DPARALLEL:'-DParallel'
current:R_LIBS:-lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide
current:RP_LIBS:-L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -lmkl_scalapack_lp64 /usr/local/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_solver_lp64.a -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -Wl,--end-group -openmp -lpthread -L/home/xiew/fftw-2.1.5/lib -lfftw_mpi -lfftw $(R_LIBS)
current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_

Best regards,
Wei Xie
Computational Materials Group
University of Wisconsin-Madison

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20101031/5eec4c81/attachment.htm>


More information about the Wien mailing list