[Wien] LAPW2 crashed when running in parallel

Wei Xie wxie4 at wisc.edu
Mon Nov 1 17:35:58 CET 2010


Hi Maxim,

Thanks for the follow-up!

I think it should be -machinefile  that's appropriate. Here's the help:
-machinefile                 # file mapping procs to machine

No -hostfile option mentioned for my current version of MPI in the help.

Yes, the machine0/1/2 files are exactly like what you described.

The parallel_options is: 
setenv USE_REMOTE 1
setenv MPI_REMOTE 1
setenv WIEN_GRANULARITY 1
setenv WIEN_MPIRUN "mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_"

I think the problem should be due to my MPI. However, even if disable MPI parallelization, the problem still persists (no evident difference in the output files, including case.dayfile, stdout and :log). Note we can run with exactly the same set of input files in serial mode with no problem. 

Again, thanks for your help!

Cheers,
Wei


On Oct 31, 2010, at 11:27 PM, Maxim Rakitin wrote:

> Dear Wei,
> 
> Maybe -machinefile is ok for your mpirun. Which options are appropriate for it? What does help say?
> 
> Try to restore your MPIRUN variable with -machinefile and rerun the calculation. Then see what is in .machine0/1/2 files and let us know. It should contain 8 lines of r1i0n0 node and 8 lines of r1i0n1 node.
> 
> One more thing you should check is $WIENROOT/parallel_options file. What is its content?
> Best regards,
>    Maxim Rakitin
>    email: rms85 at physics.susu.ac.ru
>    web: http://www.susu.ac.ru
> 
> 01.11.2010 9:06, Wei Xie пишет:
>> 
>> Hi Maxim,
>> 
>> Thanks for your reply! 
>> We tried MPIRUN=mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_, but the problem persists. The only difference is that stdout changes to ''… MPI: invalid option -hostfile …''.
>> 
>> Thanks,
>> Wei
>> 
>> 
>> On Oct 31, 2010, at 10:40 PM, Maxim Rakitin wrote:
>> 
>>> Hi,
>>> 
>>> It looks like Intel's mpirun doesn't have '-machinefile' option. Instead of this it has '-hostfile' option (form here: http://downloadmirror.intel.com/18462/eng/nes_release_notes.txt).
>>> 
>>> Try 'mpirun -h' for information about options and apply appropriate.
>>> Best regards,
>>>    Maxim Rakitin
>>>    email: rms85 at physics.susu.ac.ru
>>>    web: http://www.susu.ac.ru
>>> 
>>> 01.11.2010 4:56, Wei Xie пишет:
>>>> 
>>>> Dear all WIEN2k community members:
>>>> 
>>>> We encountered some problem when running in parallel (K-point, MPI or both)--the calculations crashed at LAPW2. Note we had no problem running it in serial. We have tried to diagnose the problem, recompile the code with difference options and test with difference cases and parameters based on similar problems reported on the mail list, but the problem persists. So we write here hoping someone can offer us some suggestion. We have attached related files below for your reference. Your replies are appreciated in advance! 
>>>> 
>>>> This is a TiC example running in both Kpoint and MPI parallel on two nodes r1i0n0 and r1i0n1 (8cores/node):
>>>> 
>>>> 1. stdout (abridged) 
>>>> MPI: invalid option -machinefile
>>>> real 0m0.004s
>>>> user 0m0.000s
>>>> sys 0m0.000s
>>>> ...
>>>> MPI: invalid option -machinefile
>>>> real 0m0.003s
>>>> user 0m0.000s
>>>> sys 0m0.004s
>>>> TiC.scf1up_1: No such file or directory.
>>>> 
>>>> LAPW2 - Error. Check file lapw2.error
>>>> cp: cannot stat `.in.tmp': No such file or directory
>>>> rm: cannot remove `.in.tmp': No such file or directory
>>>> rm: cannot remove `.in.tmp1': No such file or directory
>>>> 
>>>> 2. TiC.dayfile (abridged) 
>>>> ...
>>>>     start  (Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go)
>>>>     cycle 1  (Sun Oct 31 16:25:06 MDT 2010)  (40/99 to go)
>>>> 
>>>> >   lapw0 -p (16:25:06) starting parallel lapw0 at Sun Oct 31 16:25:07 MDT 2010
>>>> -------- .machine0 : 16 processors
>>>> invalid "local" arg: -machinefile
>>>> 
>>>> 0.436u 0.412s 0:04.63 18.1% 0+0k 2600+0io 1pf+0w
>>>> >   lapw1  -up -p    (16:25:12) starting parallel lapw1 at Sun Oct 31 16:25:12 MDT 2010
>>>> ->  starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010
>>>> running LAPW1 in parallel mode (using .machines)
>>>> 2 number_of_parallel_jobs
>>>>      r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)      r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1)      r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)    Summary of lapw1para:
>>>>    r1i0n0  k=0  user=0  wallclock=0
>>>>    r1i0n1  k=0  user=0  wallclock=0
>>>> ...
>>>> 0.116u 0.316s 0:10.48 4.0% 0+0k 0+0io 0pf+0w
>>>> >   lapw2 -up -p   (16:25:34) running LAPW2 in parallel mode
>>>> **  LAPW2 crashed!
>>>> 0.032u 0.104s 0:01.13 11.5% 0+0k 82304+0io 8pf+0w
>>>> error: command   /home/xiew/WIEN2k_10/lapw2para -up uplapw2.def   failed
>>>> 
>>>> 3. uplapw2.error 
>>>> Error in LAPW2
>>>>  'LAPW2' - can't open unit: 18                                                
>>>>  'LAPW2' -        filename: TiC.vspup                                         
>>>>  'LAPW2' -          status: old          form: formatted                      
>>>> **  testerror: Error in Parallel LAPW2
>>>> 
>>>> 4. .machines
>>>> #
>>>> 1:r1i0n0:8
>>>> 1:r1i0n1:8
>>>> lapw0:r1i0n0:8 r1i0n1:8 
>>>> granularity:1
>>>> extrafine:1
>>>> 
>>>> 5. compilers, MPI and options
>>>> Intel Compilers  and MKL 11.1.046
>>>> Intel MPI 3.2.0.011
>>>> 
>>>> current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
>>>> current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
>>>> current:LDFLAGS:$(FOPT) -L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -pthread
>>>> current:DPARALLEL:'-DParallel'
>>>> current:R_LIBS:-lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide
>>>> current:RP_LIBS:-L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -lmkl_scalapack_lp64 /usr/local/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_solver_lp64.a -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -Wl,--end-group -openmp -lpthread -L/home/xiew/fftw-2.1.5/lib -lfftw_mpi -lfftw $(R_LIBS)
>>>> current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_
>>>> 
>>>> Best regards,
>>>> Wei Xie
>>>> Computational Materials Group
>>>> University of Wisconsin-Madison
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Wien mailing list
>>>> Wien at zeus.theochem.tuwien.ac.at
>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> 
>> 
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20101101/1e2e177e/attachment.htm>


More information about the Wien mailing list