[Wien] Problem with wien2k 13.1 parallel for Slurm+intel mpi
Peter Blaha
pblaha at theochem.tuwien.ac.at
Tue Nov 26 17:35:16 CET 2013
Have you checked case.output0000 or case.scf0 ?
Do they look ok ?
Is there a reasonble line :DEN in scf0 ?
If yes, it seems that lapw0_mpi (and thus mpi + fftw2/3) works.
lapw1_mpi requires besides mpi also scalapack. This is included in Intels mkl
with your ifort compiler.
The most crucual setting is the selection of the blacks-library, and intel
supplies special blacks-libraries for Intel-mpi, open-MPI, or mvapi-mpi
and you must be sure to have linked the correct library in lapw1.
PS: I assume you have been able to run this without mpi-parallelization
in sequential mode ??
And I also assume, you could run it in k-parallel mode ?
PPS: 6 cores is not a good choice for lapw1 ! try to use squared number of
cores like 16, 64, ....
Am 26.11.2013 14:03, schrieb Natalia Pavlenko:
> Dear Prof. Blaha,
>
> thanks a lot for your reply. I have corrected the .machines file
> (the node with 6 cores is automatically chosen):
> -----------------
> lapw0: alcc92:6
> 1:alcc92:6
> granularity:1
> extrafine:1
> -----------------
> but nevertheless got the following output in case.dayfile:
> -------case.dayfile--------
>
> Calculating case in /alcc/gpfs1/home/exp6/pavlenna/work/laosto/ovac/case
> on alcc92 with PID 9804
> using WIEN2k_13.1 (Release 17/6/2013) in /alcc/gpfs1/home/exp6/pavlenna/wien
>
>
> start (Tue Nov 26 13:41:14 CET 2013) with lapw0 (50/99 to go)
>
> cycle 1 (Tue Nov 26 13:41:14 CET 2013) (50/99 to go)
>
>> lapw0 -p (13:41:15) starting parallel lapw0 at Tue Nov 26 13:41:15 CET 2013
> -------- .machine0 : 6 processors
> 0.024u 0.024s 0:12.00 0.3% 0+0k 1632+8io 6pf+0w
>> lapw1 -up -p (13:41:27) starting parallel lapw1 at Tue Nov 26 13:41:27 CET 2013
> -> starting parallel LAPW1 jobs at Tue Nov 26 13:41:27 CET 2013
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
> alcc92 alcc92 alcc92 alcc92 alcc92 alcc92(6) 0.016u 0.004s 0:00.75 1.3% 0+0k 0+8io 0pf+0w
> Summary of lapw1para:
> alcc92 k=0 user=0 wallclock=0
> 0.068u 0.020s 0:02.19 3.6% 0+0k 0+104io 0pf+0w
>> lapw1 -dn -p (13:41:29) starting parallel lapw1 at Tue Nov 26 13:41:29 CET 2013
> -> starting parallel LAPW1 jobs at Tue Nov 26 13:41:29 CET 2013
> running LAPW1 in parallel mode (using .machines.help)
> 1 number_of_parallel_jobs
> alcc92 alcc92 alcc92 alcc92 alcc92 alcc92(6) 0.020u 0.004s 0:00.42 4.7% 0+0k 0+8io 0pf+0w
> Summary of lapw1para:
> alcc92 k=0 user=0 wallclock=0
> 0.072u 0.028s 0:02.11 4.2% 0+0k 0+104io 0pf+0w
>> lapw2 -up -p (13:41:31) running LAPW2 in parallel mode
> ** LAPW2 crashed!
> 0.248u 0.012s 0:00.73 34.2% 0+0k 8+16io 0pf+0w
> error: command /alcc/gpfs1/home/exp6/pavlenna/wien/lapw2para -up uplapw2.def failed
>
>> stop error
> ---------------------------------
> In the uplapw2.err I have the following error messages:
>
> Error in LAPW2
> 'LAPW2' - can't open unit: 30
> 'LAPW2' - filename: case.energyup_1
> ** testerror: Error in Parallel LAPW2
> -----------------
> and the following error output messages:
>
> ------------------
> starting on alcc92
> LAPW0 END
> LAPW0 END
> LAPW0 END
> LAPW0 END
> LAPW0 END
> LAPW0 END
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> case.scf1up_1: No such file or directory.
> grep: No match.
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> case.scf1dn_1: No such file or directory.
> grep: No match.
> FERMI - Error
> cp: cannot stat `.in.tmp': No such file or directory
>
>> stop error
> -----------------------------------
> Please let me know, maybe something is wrong in the mpi configuration.
> I have an intel mpi installed on the cluster.
>
>
> Best regards, N.Pavlenko
>
>
> Am 2013-11-23 13:07, schrieb Peter Blaha:
>> You completely misunderstand how parallelization in wien2k works.
>> Please read the UG carefully (parallelization), also notice the
>> k-parallel and mpi-parallel options and for which case they are useful.
>>
>> I'm not familiar with "Slurm", but it looks as if you ran 6 times the same
>> sequential job in parallel, overwriting the generated files all the time.
>>
>>> I have a problem with parallel run of Wien2k 13.1 on a cluster
>>> with Slurm Environment+ Intel mpi.
>>> In a test run for 1 node with 6 cpu cores, I
>>> generated the following .machines file:
>>>
>>> -------.machines
>>> #
>>> lapw0:alcc69
>>> 1:alcc69:6
>>> granularity:1
>>> extrafine:1
>>
>> this is ok, except the lapw0 line, which would not run in parallel. Use
>>
>> lapw0:alc69:6
>>
>>> and used the following command in the script:
>>> srun -n 6 runsp_lapw -NI -cc 0.0001 -i 50
>>
>> You are running 6 times "runsp_lapw ..."
>>
>> wien2k spans its parallelization itself (provided you have properly
>> installed wien2k and specified the proper "mpirun ... command" during
>> siteconfig), but you must add the -p flag.
>>
>> So the single command
>>
>> runsp_lapw -NI -cc 0.0001 -i 50 -p
>>
>> should start 6 parallel jobs (with your machines file mpi-parallel) itself.
>> (You only need to have permission to do so).
>>
>>>
>>> In the first cycle, the lapw0,lapw1 and lapw2 are successfully
>>> finished, but after that lcore and mixer continue to run in parallel mode,
>>> they intermix with lapw0 from the second cycle and cause a crash,
>>> which can be seen from the output in case.dayfile:
>>>
>>> --------------------------------------------------------------
>>> cycle 1 (Fri Nov 22 15:32:51 CET 2013) (50/99 to go)
>>>
>>>> lapw0 (15:32:51) > lapw0 (15:32:51) > lapw0 (15:32:51) > lapw0 (15:32:51) > lapw0
>>> (15:32:51) > lapw0 (15:32:51) 44.798u 0.244s 0:45.75 98.4% 0+0k 0+0io 0pf+0w
>>>> lapw1 -up (15:33:37)
>>>> lapw1 -up (15:33:38)
>>>> lapw1 -up (15:33:38)
>>>> lapw1 -up (15:33:39)
>>>> lapw1 -up (15:33:39)
>>>> lapw1 -up (15:33:39) _nb in dscgst.F 512 128
>>> _nb in dscgst.F 512 128
>>> _nb in dscgst.F 512 128
>>> _nb in dscgst.F 512 128
>>> _nb in dscgst.F 512 128
>>> _nb in dscgst.F 512 128
>>>> lapw1 -dn (16:12:48)
>>>> lapw1 -dn (16:13:25)
>>>> lapw1 -dn (16:13:29)
>>>> lapw1 -dn (16:13:30)
>>>> lapw1 -dn (16:13:42)
>>>> lapw1 -dn (16:13:47) _nb in dscgst.F 512 128
>>> _nb in dscgst.F 512 128
>>> _nb in dscgst.F 512 128
>>> _nb in dscgst.F 512 128
>>> _nb in dscgst.F 512 128
>>> _nb in dscgst.F 512 128
>>>> lapw2 -up (17:07:01)
>>>> lapw2 -up (17:07:57)
>>>> lapw2 -up (17:08:44)
>>>> lapw2 -up (17:08:52)
>>>> lapw2 -dn (17:09:00)
>>>> lapw2 -up (17:09:01)
>>>> lapw2 -up (17:09:02)
>>>> lapw2 -dn (17:09:52)
>>>> lapw2 -dn (17:10:40)
>>>> lapw2 -dn (17:10:56)
>>>> lapw2 -dn (17:11:03)
>>>> lapw2 -dn (17:11:13)
>>>> lcore -up (17:11:40) 0.124u 0.024s 0:00.33 42.4% 0+0k 0+0io 0pf+0w
>>>> lcore -dn (17:11:41) 0.120u 0.024s 0:00.30 46.6% 0+0k 0+0io 0pf+0w
>>>> mixer (17:11:42) 0.172u 0.092s 0:00.58 44.8% 0+0k 0+0io 0pf+0w
>>> error: command /alcc/gpfs1/home/exp6/pavlenna/wien/mixer mixer.def failed
>>>
>>>> stop error
>>>> lcore -up (17:12:15) 0.132u 0.012s 0:00.20 70.0% 0+0k 0+0io 0pf+0w
>>>> lcore -dn (17:12:15) 0.128u 0.012s 0:00.20 65.0% 0+0k 0+0io 0pf+0w
>>>> mixer (17:11:42) 0.172u 0.092s 0:00.58 44.8% 0+0k 0+0io 0pf+0w
>>> error: command /alcc/gpfs1/home/exp6/pavlenna/wien/mixer mixer.def failed
>>>
>>>> stop error
>>>> lcore -up (17:12:15) 0.132u 0.012s 0:00.20 70.0% 0+0k 0+0io 0pf+0w
>>>> lcore -dn (17:12:15) 0.128u 0.012s 0:00.20 65.0% 0+0k 0+0io 0pf+0w
>>>> mixer (17:12:16) 0.680u 0.132s 0:02.28 35.5% 0+0k 0+0io 0pf+0w
>>> :ENERGY convergence: 0 0 0
>>> :CHARGE convergence: 0 0.0001 0
>>>
>>> cycle 2 (Fri Nov 22 17:12:18 CET 2013) (49/98 to go)
>>>
>>>> lapw0 (17:12:18)
>>>> lcore -up (17:12:58) 0.000u 0.008s 0:00.00 0.0% 0+0k 0+0io 0pf+0w
>>> error: command /alcc/gpfs1/home/exp6/pavlenna/wien/lcore uplcore.def failed
>>>
>>>> stop error
>>>> lcore -up (17:13:02) 0.000u 0.008s 0:00.00 0.0% 0+0k 0+0io 0pf+0w
>>> error: command /alcc/gpfs1/home/exp6/pavlenna/wien/lcore uplcore.def failed
>>>
>>>> stop error
>>> ------------------------------------------------------------------------------
>>>
>>> It looks like the .machines file needs some additional details about
>>> the calculation mode for lcore and mixer. How to configure properly the
>>> .machines in this case?
>>>
>>>
>>> Best regards, N.Pavlenko
>>>
>>>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
--
-----------------------------------------
Peter Blaha
Inst. Materials Chemistry, TU Vienna
Getreidemarkt 9, A-1060 Vienna, Austria
Tel: +43-1-5880115671
Fax: +43-1-5880115698
email: pblaha at theochem.tuwien.ac.at
-----------------------------------------
More information about the Wien
mailing list