[Wien] LAPW2 crashed when running in parallel
Maxim Rakitin
rms85 at physics.susu.ac.ru
Mon Nov 1 18:15:30 CET 2010
Hi Wei,
The parallel_options file manages how parallel programs run, so change
the following line in it:
setenv WIEN_MPIRUN "mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_"
to
setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"
Your .machine0/1/2 files are correct,
Also I believe that 'USE_REMOTE' variable which is set to 1 makes
parallel scripts (I mean lapw[012]para_lapw) to be launched using
ssh/rsh. So switch it to '0'. I'm not sure about 'MPI_REMOTE' option,
it's a new one. Try to set different values (0 or 1) for it.
Hope this will help.
Best regards,
Maxim Rakitin
email: rms85 at physics.susu.ac.ru
web: http://www.susu.ac.ru
01.11.2010 21:35, Wei Xie пишет:
> Hi Maxim,
>
> Thanks for the follow-up!
>
> I think it should be -machinefile that's appropriate. Here's the help:
> -machinefile # file mapping procs to machine
>
> No -hostfile option mentioned for my current version of MPI in the help.
>
> Yes, the machine0/1/2 files are exactly like what you described.
>
> The parallel_options is:
> setenv USE_REMOTE 1
> setenv MPI_REMOTE 1
> setenv WIEN_GRANULARITY 1
> setenv WIEN_MPIRUN "mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_"
>
> I think the problem should be due to my MPI. However, even if disable
> MPI parallelization, the problem still persists (no evident difference
> in the output files, including case.dayfile, stdout and :log). Note we
> can run with exactly the same set of input files in serial mode with
> no problem.
>
> Again, thanks for your help!
>
> Cheers,
> Wei
>
>
> On Oct 31, 2010, at 11:27 PM, Maxim Rakitin wrote:
>
>> Dear Wei,
>>
>> Maybe -machinefile is ok for your mpirun. Which options are
>> appropriate for it? What does help say?
>>
>> Try to restore your MPIRUN variable with -machinefile and rerun the
>> calculation. Then see what is in .machine0/1/2 files and let us know.
>> It should contain 8 lines of r1i0n0 node and 8 lines of r1i0n1 node.
>>
>> One more thing you should check is $WIENROOT/parallel_options file.
>> What is its content?
>> Best regards,
>> Maxim Rakitin
>> email:rms85 at physics.susu.ac.ru
>> web:http://www.susu.ac.ru
>>
>> 01.11.2010 9:06, Wei Xie пишет:
>>> Hi Maxim,
>>>
>>> Thanks for your reply!
>>> We tried MPIRUN=mpirun -np _NP_ -hostfile _HOSTS_ _EXEC_, but the
>>> problem persists. The only difference is that stdout changes to
>>> ''… MPI: invalid option -hostfile …''.
>>>
>>> Thanks,
>>> Wei
>>>
>>>
>>> On Oct 31, 2010, at 10:40 PM, Maxim Rakitin wrote:
>>>
>>>> Hi,
>>>>
>>>> It looks like Intel's mpirun doesn't have '-machinefile' option.
>>>> Instead of this it has '-hostfile' option (form here:
>>>> http://downloadmirror.intel.com/18462/eng/nes_release_notes.txt).
>>>>
>>>> Try 'mpirun -h' for information about options and apply appropriate.
>>>> Best regards,
>>>> Maxim Rakitin
>>>> email:rms85 at physics.susu.ac.ru
>>>> web:http://www.susu.ac.ru
>>>>
>>>> 01.11.2010 4:56, Wei Xie пишет:
>>>>> Dear all WIEN2k community members:
>>>>>
>>>>> We encountered some problem when running in parallel
>>>>> (K-point, MPI or both)--the calculations crashed at LAPW2. Note we
>>>>> had no problem running it in serial. We have tried to diagnose the
>>>>> problem, recompile the code with difference options and test with
>>>>> difference cases and parameters based on similar problems reported
>>>>> on the mail list, but the problem persists. So we write here
>>>>> hoping someone can offer us some suggestion. We have attached
>>>>> related files below for your reference. Your replies are
>>>>> appreciated in advance!
>>>>>
>>>>> This is a TiC example running in both Kpoint and MPI parallel on
>>>>> two nodes /r1i0n0/ and /r1i0n1/ (8cores/node):
>>>>>
>>>>> *1. **stdout **(abridged) *
>>>>> MPI: invalid option -machinefile
>>>>> real0m0.004s
>>>>> user0m0.000s
>>>>> sys0m0.000s
>>>>> ...
>>>>> MPI: invalid option -machinefile
>>>>> real0m0.003s
>>>>> user0m0.000s
>>>>> sys0m0.004s
>>>>> TiC.scf1up_1: No such file or directory.
>>>>>
>>>>> LAPW2 - Error. Check file lapw2.error
>>>>> cp: cannot stat `.in.tmp': No such file or directory
>>>>> rm: cannot remove `.in.tmp': No such file or directory
>>>>> *rm: cannot remove `.in.tmp1': No such file or directory*
>>>>> *
>>>>> *
>>>>> *2. TiC.dayfile (abridged) *
>>>>> ...
>>>>> start (Sun Oct 31 16:25:06 MDT 2010) with lapw0 (40/99 to go)
>>>>> cycle 1 (Sun Oct 31 16:25:06 MDT 2010) (40/99 to go)
>>>>>
>>>>> > lapw0 -p(16:25:06) starting parallel lapw0 at Sun Oct 31
>>>>> 16:25:07 MDT 2010
>>>>> -------- .machine0 : 16 processors
>>>>> invalid "local" arg: -machinefile
>>>>>
>>>>> 0.436u 0.412s 0:04.63 18.1%0+0k 2600+0io 1pf+0w
>>>>> > lapw1 -up -p (16:25:12) starting parallel lapw1 at Sun Oct 31
>>>>> 16:25:12 MDT 2010
>>>>> -> starting parallel LAPW1 jobs at Sun Oct 31 16:25:12 MDT 2010
>>>>> running LAPW1 in parallel mode (using .machines)
>>>>> 2 number_of_parallel_jobs
>>>>> r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)
>>>>> r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1 r1i0n1(1)
>>>>> r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0 r1i0n0(1)
>>>>> Summary of lapw1para:
>>>>> r1i0n0 k=0 user=0 wallclock=0
>>>>> r1i0n1 k=0 user=0 wallclock=0
>>>>> ...
>>>>> 0.116u 0.316s 0:10.48 4.0%0+0k 0+0io 0pf+0w
>>>>> > lapw2 -up -p (16:25:34) running LAPW2 in parallel mode
>>>>> ** LAPW2 crashed!
>>>>> 0.032u 0.104s 0:01.13 11.5%0+0k 82304+0io 8pf+0w
>>>>> error: command /home/xiew/WIEN2k_10/lapw2para -up uplapw2.def
>>>>> failed
>>>>>
>>>>> *3. uplapw2.error *
>>>>> Error in LAPW2
>>>>> 'LAPW2' - can't open unit: 18
>>>>> 'LAPW2' - filename: TiC.vspup
>>>>> 'LAPW2' - status: old form: formatted
>>>>> ** testerror: Error in Parallel LAPW2
>>>>>
>>>>> *4. .machines*
>>>>> #
>>>>> 1:r1i0n0:8
>>>>> 1:r1i0n1:8
>>>>> lapw0:r1i0n0:8 r1i0n1:8
>>>>> granularity:1
>>>>> extrafine:1
>>>>>
>>>>> *5. compilers, MPI and options*
>>>>> Intel Compilers and MKL 11.1.046
>>>>> Intel MPI 3.2.0.011
>>>>>
>>>>> current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML
>>>>> -traceback
>>>>> current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML
>>>>> -traceback
>>>>> current:LDFLAGS:$(FOPT)
>>>>> -L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t -pthread
>>>>> current:DPARALLEL:'-DParallel'
>>>>> current:R_LIBS:-lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread
>>>>> -lmkl_core -openmp -lpthread -lguide
>>>>> current:RP_LIBS:-L/usr/local/intel/Compiler/11.1/046/mkl/lib/em64t
>>>>> -lmkl_scalapack_lp64
>>>>> /usr/local/intel/Compiler/11.1/046/mkl/lib/em64t/libmkl_solver_lp64.a
>>>>> -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core
>>>>> -lmkl_blacs_intelmpi_lp64 -Wl,--end-group -openmp -lpthread
>>>>> -L/home/xiew/fftw-2.1.5/lib -lfftw_mpi -lfftw $(R_LIBS)
>>>>> current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_
>>>>>
>>>>> Best regards,
>>>>> Wei Xie
>>>>> Computational Materials Group
>>>>> University of Wisconsin-Madison
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Wien mailing list
>>>>> Wien at zeus.theochem.tuwien.ac.at
>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>> _______________________________________________
>>>> Wien mailing list
>>>> Wien at zeus.theochem.tuwien.ac.at
>>>> <mailto:Wien at zeus.theochem.tuwien.ac.at>
>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>
>>>
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20101101/5d6b9b02/attachment.htm>
More information about the Wien
mailing list