[Wien] LAPW2 crashed when running in parallel

Wei Xie wxie4 at wisc.edu
Mon Nov 1 05:06:26 CET 2010


Hi Maxim,

Thanks for your reply! 
We tried MPIRUN=mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_, but the problem persists. The only difference is that stdout changes to ''… MPI: invalid option -hostfile …''.

Thanks,
Wei


On Oct 31, 2010, at 10:40 PM, Maxim Rakitin wrote:

> Hi,
> 
> It looks like Intel's mpirun doesn't have '-machinefile' option. Instead of this it has '-hostfile' option (form here: http://downloadmirror.intel.com/18462/eng/nes_release_notes.txt).
> 
> Try 'mpirun -h' for information about options and apply appropriate.
> Best regards,
>    Maxim Rakitin
>    email: rms85 at physics.susu.ac.ru
>    web: http://www.susu.ac.ru
> 
> 01.11.2010 4:56, Wei Xie пишет:
>> 
>> Dear all WIEN2k community members:
>> 
>> We encountered some problem when running in parallel (K-point, MPI or both)--the calculations crashed at LAPW2. Note we had no problem running it in serial. We have tried to diagnose the problem, recompile the code with difference options and test with difference cases and parameters based on similar problems reported on the mail list, but the problem persists. So we write here hoping someone can offer us some suggestion. We have attached related files below for your reference. Your replies are appreciated in advance! 
>> 
>> This is a TiC example running in both Kpoint and MPI parallel on two nodes r1i0n0 and r1i0n1 (8cores/node):
>> 
>> 1. stdout (abridged) 
>> MPI: invalid option -machinefile
>> real 0m0.004s
>> user 0m0.000s
>> sys 0m0.000s
>> ...
>> MPI: invalid option -machinefile
>> real 0m0.003s
>> user 0m0.000s
>> sys 0m0.004s
>> TiC.scf1up_1: No such file or directory.
>> 
>> LAPW2 - Error. Check file lapw2.error
>> cp: cannot stat `.in.tmp': No such file or directory
>> rm: cannot remove `.in.tmp': No such file or directory
>> rm: cannot remove `.in.tmp1': No such file or directory
>> 
>> 2. TiC.dayfile (abridged) 
>> ...
>>     start  (Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go)
>>     cycle 1  (Sun Oct 31 16:25:06 MDT 2010)  (40/99 to go)
>> 
>> >   lapw0 -p (16:25:06) starting parallel lapw0 at Sun Oct 31 16:25:07 MDT 2010
>> -------- .machine0 : 16 processors
>> invalid "local" arg: -machinefile
>> 
>> 0.436u 0.412s 0:04.63 18.1% 0+0k 2600+0io 1pf+0w
>> >   lapw1  -up -p    (16:25:12) starting parallel lapw1 at Sun Oct 31 16:25:12 MDT 2010
>> ->  starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010
>> running LAPW1 in parallel mode (using .machines)
>> 2 number_of_parallel_jobs
>>      r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)      r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1)      r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)    Summary of lapw1para:
>>    r1i0n0  k=0  user=0  wallclock=0
>>    r1i0n1  k=0  user=0  wallclock=0
>> ...
>> 0.116u 0.316s 0:10.48 4.0% 0+0k 0+0io 0pf+0w
>> >   lapw2 -up -p   (16:25:34) running LAPW2 in parallel mode
>> **  LAPW2 crashed!
>> 0.032u 0.104s 0:01.13 11.5% 0+0k 82304+0io 8pf+0w
>> error: command   /home/xiew/WIEN2k_10/lapw2para -up uplapw2.def   failed
>> 
>> 3. uplapw2.error 
>> Error in LAPW2
>>  'LAPW2' - can't open unit: 18                                                
>>  'LAPW2' -        filename: TiC.vspup                                         
>>  'LAPW2' -          status: old          form: formatted                      
>> **  testerror: Error in Parallel LAPW2
>> 
>> 4. .machines
>> #
>> 1:r1i0n0:8
>> 1:r1i0n1:8
>> lapw0:r1i0n0:8 r1i0n1:8 
>> granularity:1
>> extrafine:1
>> 
>> 5. compilers, MPI and options
>> Intel Compilers  and MKL 11.1.046
>> Intel MPI 3.2.0.011
>> 
>> current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
>> current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
>> current:LDFLAGS:$(FOPT) -L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -pthread
>> current:DPARALLEL:'-DParallel'
>> current:R_LIBS:-lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide
>> current:RP_LIBS:-L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -lmkl_scalapack_lp64 /usr/local/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_solver_lp64.a -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -Wl,--end-group -openmp -lpthread -L/home/xiew/fftw-2.1.5/lib -lfftw_mpi -lfftw $(R_LIBS)
>> current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_
>> 
>> Best regards,
>> Wei Xie
>> Computational Materials Group
>> University of Wisconsin-Madison
>> 
>> 
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20101031/2ce15505/attachment.htm>


More information about the Wien mailing list