[Wien] Problem with wien2k 13.1 parallel for Slurm+intel mpi
Natalia Pavlenko
natalia.pavlenko at physik.uni-augsburg.de
Thu Nov 28 11:52:31 CET 2013
Dear Prof. Blaha, dear Prof. Marks,
thanks a lot for help, the files case.output0000 or case.scf0 were
correct, the density in scf0 was reasonable.
I have changed the blacs library and recompiled all mpi-programs, now
it works well.
Here are my compilation options from the file OPTIONS (Slurm
cluster+intel mpi):
current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML
-traceback -assume buffered_io
current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML
-traceback -assume buffered_io
current:FFTW_OPT:-DFFTW3
-I/alcc/gpfs1/home/exp6/pavlenna/lib/fftw/fftw-3.3.3/include
current:FFTW_LIBS:-lfftw3_mpi -lfftw3
-L/alcc/gpfs1/home/exp6/pavlenna/lib/fftw/fftw-3.3.3/lib
current:LDFLAGS:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread
current:DPARALLEL:'-DParallel'
current:R_LIBS:-lmkl_lapack95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread
-lmkl_core -openmp -lpthread
current:RP_LIBS:-lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
$(R_LIBS)
current:MPIRUN:srun -n _NP_
current:MKL_TARGET_ARCH:intel64
Best regards, Natalia Pavlenko
Am 2013-11-26 17:35, schrieb Peter Blaha:
> Have you checked case.output0000 or case.scf0 ?
> Do they look ok ?
> Is there a reasonble line :DEN in scf0 ?
>
> If yes, it seems that lapw0_mpi (and thus mpi + fftw2/3) works.
>
> lapw1_mpi requires besides mpi also scalapack. This is included in
> Intels mkl
> with your ifort compiler.
> The most crucual setting is the selection of the blacks-library, and
> intel
> supplies special blacks-libraries for Intel-mpi, open-MPI, or
> mvapi-mpi
> and you must be sure to have linked the correct library in lapw1.
>
> PS: I assume you have been able to run this without
> mpi-parallelization
> in sequential mode ??
> And I also assume, you could run it in k-parallel mode ?
>
> PPS: 6 cores is not a good choice for lapw1 ! try to use squared
> number of
> cores like 16, 64, ....
>
> Am 26.11.2013 14:03, schrieb Natalia Pavlenko:
>> Dear Prof. Blaha,
>>
>> thanks a lot for your reply. I have corrected the .machines file
>> (the node with 6 cores is automatically chosen):
>> -----------------
>> lapw0: alcc92:6
>> 1:alcc92:6
>> granularity:1
>> extrafine:1
>> -----------------
>> but nevertheless got the following output in case.dayfile:
>> -------case.dayfile--------
>>
>> Calculating case in
>> /alcc/gpfs1/home/exp6/pavlenna/work/laosto/ovac/case
>> on alcc92 with PID 9804
>> using WIEN2k_13.1 (Release 17/6/2013) in
>> /alcc/gpfs1/home/exp6/pavlenna/wien
>>
>>
>> start (Tue Nov 26 13:41:14 CET 2013) with lapw0 (50/99 to
>> go)
>>
>> cycle 1 (Tue Nov 26 13:41:14 CET 2013) (50/99 to go)
>>
>>> lapw0 -p (13:41:15) starting parallel lapw0 at Tue Nov 26
>>> 13:41:15 CET 2013
>> -------- .machine0 : 6 processors
>> 0.024u 0.024s 0:12.00 0.3% 0+0k 1632+8io 6pf+0w
>>> lapw1 -up -p (13:41:27) starting parallel lapw1 at Tue Nov
>>> 26 13:41:27 CET 2013
>> -> starting parallel LAPW1 jobs at Tue Nov 26 13:41:27 CET 2013
>> running LAPW1 in parallel mode (using .machines)
>> 1 number_of_parallel_jobs
>> alcc92 alcc92 alcc92 alcc92 alcc92 alcc92(6) 0.016u 0.004s
>> 0:00.75 1.3% 0+0k 0+8io 0pf+0w
>> Summary of lapw1para:
>> alcc92 k=0 user=0 wallclock=0
>> 0.068u 0.020s 0:02.19 3.6% 0+0k 0+104io 0pf+0w
>>> lapw1 -dn -p (13:41:29) starting parallel lapw1 at Tue Nov
>>> 26 13:41:29 CET 2013
>> -> starting parallel LAPW1 jobs at Tue Nov 26 13:41:29 CET 2013
>> running LAPW1 in parallel mode (using .machines.help)
>> 1 number_of_parallel_jobs
>> alcc92 alcc92 alcc92 alcc92 alcc92 alcc92(6) 0.020u 0.004s
>> 0:00.42 4.7% 0+0k 0+8io 0pf+0w
>> Summary of lapw1para:
>> alcc92 k=0 user=0 wallclock=0
>> 0.072u 0.028s 0:02.11 4.2% 0+0k 0+104io 0pf+0w
>>> lapw2 -up -p (13:41:31) running LAPW2 in parallel mode
>> ** LAPW2 crashed!
>> 0.248u 0.012s 0:00.73 34.2% 0+0k 8+16io 0pf+0w
>> error: command /alcc/gpfs1/home/exp6/pavlenna/wien/lapw2para -up
>> uplapw2.def failed
>>
>>> stop error
>> ---------------------------------
>> In the uplapw2.err I have the following error messages:
>>
>> Error in LAPW2
>> 'LAPW2' - can't open unit: 30
>> 'LAPW2' - filename: case.energyup_1
>> ** testerror: Error in Parallel LAPW2
>> -----------------
>> and the following error output messages:
>>
>> ------------------
>> starting on alcc92
>> LAPW0 END
>> LAPW0 END
>> LAPW0 END
>> LAPW0 END
>> LAPW0 END
>> LAPW0 END
>> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
>> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
>> PMPI_Comm_size(76).: Invalid communicator
>> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
>> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
>> PMPI_Comm_size(76).: Invalid communicator
>> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
>> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
>> PMPI_Comm_size(76).: Invalid communicator
>> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
>> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
>> PMPI_Comm_size(76).: Invalid communicator
>> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
>> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
>> PMPI_Comm_size(76).: Invalid communicator
>> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
>> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
>> PMPI_Comm_size(76).: Invalid communicator
>> case.scf1up_1: No such file or directory.
>> grep: No match.
>> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
>> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
>> PMPI_Comm_size(76).: Invalid communicator
>> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
>> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
>> PMPI_Comm_size(76).: Invalid communicator
>> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
>> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
>> PMPI_Comm_size(76).: Invalid communicator
>> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
>> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
>> PMPI_Comm_size(76).: Invalid communicator
>> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
>> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
>> PMPI_Comm_size(76).: Invalid communicator
>> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
>> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
>> PMPI_Comm_size(76).: Invalid communicator
>> case.scf1dn_1: No such file or directory.
>> grep: No match.
>> FERMI - Error
>> cp: cannot stat `.in.tmp': No such file or directory
>>
>>> stop error
>> -----------------------------------
>> Please let me know, maybe something is wrong in the mpi
>> configuration.
>> I have an intel mpi installed on the cluster.
>>
>>
>> Best regards, N.Pavlenko
>>
>>
>> Am 2013-11-23 13:07, schrieb Peter Blaha:
>>> You completely misunderstand how parallelization in wien2k works.
>>> Please read the UG carefully (parallelization), also notice the
>>> k-parallel and mpi-parallel options and for which case they are
>>> useful.
>>>
>>> I'm not familiar with "Slurm", but it looks as if you ran 6 times
>>> the same
>>> sequential job in parallel, overwriting the generated files all the
>>> time.
>>>
>>>> I have a problem with parallel run of Wien2k 13.1 on a cluster
>>>> with Slurm Environment+ Intel mpi.
>>>> In a test run for 1 node with 6 cpu cores, I
>>>> generated the following .machines file:
>>>>
>>>> -------.machines
>>>> #
>>>> lapw0:alcc69
>>>> 1:alcc69:6
>>>> granularity:1
>>>> extrafine:1
>>>
>>> this is ok, except the lapw0 line, which would not run in parallel.
>>> Use
>>>
>>> lapw0:alc69:6
>>>
>>>> and used the following command in the script:
>>>> srun -n 6 runsp_lapw -NI -cc 0.0001 -i 50
>>>
>>> You are running 6 times "runsp_lapw ..."
>>>
>>> wien2k spans its parallelization itself (provided you have properly
>>> installed wien2k and specified the proper "mpirun ... command"
>>> during
>>> siteconfig), but you must add the -p flag.
>>>
>>> So the single command
>>>
>>> runsp_lapw -NI -cc 0.0001 -i 50 -p
>>>
>>> should start 6 parallel jobs (with your machines file mpi-parallel)
>>> itself.
>>> (You only need to have permission to do so).
>>>
>>>>
>>>> In the first cycle, the lapw0,lapw1 and lapw2 are successfully
>>>> finished, but after that lcore and mixer continue to run in
>>>> parallel mode,
>>>> they intermix with lapw0 from the second cycle and cause a crash,
>>>> which can be seen from the output in case.dayfile:
>>>>
>>>> --------------------------------------------------------------
>>>> cycle 1 (Fri Nov 22 15:32:51 CET 2013) (50/99 to go)
>>>>
>>>>> lapw0 (15:32:51) > lapw0 (15:32:51) > lapw0
>>>>> (15:32:51) > lapw0 (15:32:51) > lapw0
>>>> (15:32:51) > lapw0 (15:32:51) 44.798u 0.244s 0:45.75
>>>> 98.4% 0+0k 0+0io 0pf+0w
>>>>> lapw1 -up (15:33:37)
>>>>> lapw1 -up (15:33:38)
>>>>> lapw1 -up (15:33:38)
>>>>> lapw1 -up (15:33:39)
>>>>> lapw1 -up (15:33:39)
>>>>> lapw1 -up (15:33:39) _nb in dscgst.F 512 128
>>>> _nb in dscgst.F 512 128
>>>> _nb in dscgst.F 512 128
>>>> _nb in dscgst.F 512 128
>>>> _nb in dscgst.F 512 128
>>>> _nb in dscgst.F 512 128
>>>>> lapw1 -dn (16:12:48)
>>>>> lapw1 -dn (16:13:25)
>>>>> lapw1 -dn (16:13:29)
>>>>> lapw1 -dn (16:13:30)
>>>>> lapw1 -dn (16:13:42)
>>>>> lapw1 -dn (16:13:47) _nb in dscgst.F 512 128
>>>> _nb in dscgst.F 512 128
>>>> _nb in dscgst.F 512 128
>>>> _nb in dscgst.F 512 128
>>>> _nb in dscgst.F 512 128
>>>> _nb in dscgst.F 512 128
>>>>> lapw2 -up (17:07:01)
>>>>> lapw2 -up (17:07:57)
>>>>> lapw2 -up (17:08:44)
>>>>> lapw2 -up (17:08:52)
>>>>> lapw2 -dn (17:09:00)
>>>>> lapw2 -up (17:09:01)
>>>>> lapw2 -up (17:09:02)
>>>>> lapw2 -dn (17:09:52)
>>>>> lapw2 -dn (17:10:40)
>>>>> lapw2 -dn (17:10:56)
>>>>> lapw2 -dn (17:11:03)
>>>>> lapw2 -dn (17:11:13)
>>>>> lcore -up (17:11:40) 0.124u 0.024s 0:00.33 42.4% 0+0k 0+0io
>>>>> 0pf+0w
>>>>> lcore -dn (17:11:41) 0.120u 0.024s 0:00.30 46.6% 0+0k 0+0io
>>>>> 0pf+0w
>>>>> mixer (17:11:42) 0.172u 0.092s 0:00.58 44.8% 0+0k 0+0io
>>>>> 0pf+0w
>>>> error: command /alcc/gpfs1/home/exp6/pavlenna/wien/mixer
>>>> mixer.def failed
>>>>
>>>>> stop error
>>>>> lcore -up (17:12:15) 0.132u 0.012s 0:00.20 70.0% 0+0k 0+0io
>>>>> 0pf+0w
>>>>> lcore -dn (17:12:15) 0.128u 0.012s 0:00.20 65.0% 0+0k 0+0io
>>>>> 0pf+0w
>>>>> mixer (17:11:42) 0.172u 0.092s 0:00.58 44.8% 0+0k 0+0io
>>>>> 0pf+0w
>>>> error: command /alcc/gpfs1/home/exp6/pavlenna/wien/mixer
>>>> mixer.def failed
>>>>
>>>>> stop error
>>>>> lcore -up (17:12:15) 0.132u 0.012s 0:00.20 70.0% 0+0k 0+0io
>>>>> 0pf+0w
>>>>> lcore -dn (17:12:15) 0.128u 0.012s 0:00.20 65.0% 0+0k 0+0io
>>>>> 0pf+0w
>>>>> mixer (17:12:16) 0.680u 0.132s 0:02.28 35.5% 0+0k 0+0io
>>>>> 0pf+0w
>>>> :ENERGY convergence: 0 0 0
>>>> :CHARGE convergence: 0 0.0001 0
>>>>
>>>> cycle 2 (Fri Nov 22 17:12:18 CET 2013) (49/98 to go)
>>>>
>>>>> lapw0 (17:12:18)
>>>>> lcore -up (17:12:58) 0.000u 0.008s 0:00.00 0.0% 0+0k 0+0io
>>>>> 0pf+0w
>>>> error: command /alcc/gpfs1/home/exp6/pavlenna/wien/lcore
>>>> uplcore.def failed
>>>>
>>>>> stop error
>>>>> lcore -up (17:13:02) 0.000u 0.008s 0:00.00 0.0% 0+0k 0+0io
>>>>> 0pf+0w
>>>> error: command /alcc/gpfs1/home/exp6/pavlenna/wien/lcore
>>>> uplcore.def failed
>>>>
>>>>> stop error
>>>> ------------------------------------------------------------------------------
>>>>
>>>> It looks like the .machines file needs some additional details
>>>> about
>>>> the calculation mode for lcore and mixer. How to configure properly
>>>> the
>>>> .machines in this case?
>>>>
>>>>
>>>> Best regards, N.Pavlenko
>>>>
>>>>
>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at:
>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
More information about the Wien
mailing list