[Wien] Problem with wien2k 13.1 parallel for Slurm+intel mpi

Tue Nov 26 15:50:47 CET 2013

The "PMPI_Comm_size: Invalid communicator, error stack" is almost
always due to issues with how the mpi version was compiled and linked.
Common issues include:
1) Not using the ifort/icc mpi compilers.
2) Not using the correct linking options for the flavor of mpi that
you are using
3) Problems with the infiniband or similar drivers on the system
(rare, but not unknown).

For 1), please check that the mpif77 (or mpif90) you used is the Intel
one -- you may need to source the scripts in the Intel bin directory
to set these up correctly.

For 2), the Intel Math Kernel Library at
http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor
is useful.

N.B., I susect that lapw0_mpi did not run, as well as lapw1_mpi.

On Tue, Nov 26, 2013 at 7:03 AM, Natalia Pavlenko
<natalia.pavlenko at physik.uni-augsburg.de> wrote:
> Dear Prof. Blaha,
>
> thanks a lot for your reply. I have corrected the .machines file
> (the node with 6 cores is automatically chosen):
> -----------------
> lapw0: alcc92:6
> 1:alcc92:6
> granularity:1
> extrafine:1
> -----------------
> but nevertheless got the following output in case.dayfile:
> -------case.dayfile--------
>
> Calculating case in
> /alcc/gpfs1/home/exp6/pavlenna/work/laosto/ovac/case
> on alcc92 with PID 9804
> using WIEN2k_13.1 (Release 17/6/2013) in
> /alcc/gpfs1/home/exp6/pavlenna/wien
>
>
>      start       (Tue Nov 26 13:41:14 CET 2013) with lapw0 (50/99 to go)
>
>      cycle 1     (Tue Nov 26 13:41:14 CET 2013)  (50/99 to go)
>
>>   lapw0 -p    (13:41:15) starting parallel lapw0 at Tue Nov 26
>> 13:41:15 CET 2013
> -------- .machine0 : 6 processors
> 0.024u 0.024s 0:12.00 0.3%      0+0k 1632+8io 6pf+0w
>>   lapw1  -up -p       (13:41:27) starting parallel lapw1 at Tue Nov 26
>> 13:41:27 CET 2013
> ->  starting parallel LAPW1 jobs at Tue Nov 26 13:41:27 CET 2013
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
>       alcc92 alcc92 alcc92 alcc92 alcc92 alcc92(6) 0.016u 0.004s 0:00.75
> 1.3%    0+0k 0+8io 0pf+0w
>     Summary of lapw1para:
>     alcc92        k=0     user=0  wallclock=0
> 0.068u 0.020s 0:02.19 3.6%      0+0k 0+104io 0pf+0w
>>   lapw1  -dn -p       (13:41:29) starting parallel lapw1 at Tue Nov 26
>> 13:41:29 CET 2013
> ->  starting parallel LAPW1 jobs at Tue Nov 26 13:41:29 CET 2013
> running LAPW1 in parallel mode (using .machines.help)
> 1 number_of_parallel_jobs
>       alcc92 alcc92 alcc92 alcc92 alcc92 alcc92(6) 0.020u 0.004s 0:00.42
> 4.7%    0+0k 0+8io 0pf+0w
>     Summary of lapw1para:
>     alcc92        k=0     user=0  wallclock=0
> 0.072u 0.028s 0:02.11 4.2%      0+0k 0+104io 0pf+0w
>>   lapw2 -up -p        (13:41:31) running LAPW2 in parallel mode
> **  LAPW2 crashed!
> 0.248u 0.012s 0:00.73 34.2%     0+0k 8+16io 0pf+0w
> error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/lapw2para -up
> uplapw2.def   failed
>
>>   stop error
> ---------------------------------
> In the uplapw2.err I have the following error messages:
>
> Error in LAPW2
>   'LAPW2' - can't open unit: 30
>   'LAPW2' -        filename: case.energyup_1
> **  testerror: Error in Parallel LAPW2
> -----------------
> and the following error output messages:
>
> ------------------
> starting on alcc92
>   LAPW0 END
>   LAPW0 END
>   LAPW0 END
>   LAPW0 END
>   LAPW0 END
>   LAPW0 END
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> case.scf1up_1: No such file or directory.
> grep: No match.
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
> PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
> PMPI_Comm_size(76).: Invalid communicator
> case.scf1dn_1: No such file or directory.
> grep: No match.
> FERMI - Error
> cp: cannot stat `.in.tmp': No such file or directory
>
>>   stop error
> -----------------------------------
> Please let me know, maybe something is wrong in the mpi configuration.
> I have an intel mpi installed on the cluster.
>
>
> Best regards, N.Pavlenko
>
>
> Am 2013-11-23 13:07, schrieb Peter Blaha:
>> You completely misunderstand how parallelization in wien2k works.
>> Please read the UG carefully (parallelization), also notice the
>> k-parallel and mpi-parallel options and for which case they are
>> useful.
>>
>> I'm not familiar with "Slurm", but it looks as if you ran 6 times the
>> same
>> sequential job in parallel, overwriting the generated files all the
>> time.
>>
>>> I have a problem with parallel run of Wien2k 13.1 on a cluster
>>> with Slurm Environment+ Intel mpi.
>>> In a test run for 1 node with 6 cpu cores, I
>>> generated the following .machines file:
>>>
>>> -------.machines
>>> #
>>> lapw0:alcc69
>>> 1:alcc69:6
>>> granularity:1
>>> extrafine:1
>>
>> this is ok, except the lapw0 line, which would not run in parallel.
>> Use
>>
>> lapw0:alc69:6
>>
>>> and  used the following command in the script:
>>> srun -n 6  runsp_lapw -NI -cc 0.0001 -i 50
>>
>> You are running 6 times  "runsp_lapw ..."
>>
>> wien2k spans its parallelization itself (provided you have properly
>> installed wien2k and specified the proper "mpirun ... command" during
>> siteconfig), but you must add the  -p flag.
>>
>> So the single command
>>
>> runsp_lapw -NI -cc 0.0001 -i 50 -p
>>
>> should start 6 parallel jobs (with your machines file mpi-parallel)
>> itself.
>> (You only need to have permission to do so).
>>
>>>
>>> In the first cycle, the lapw0,lapw1 and lapw2 are successfully
>>> finished, but after that lcore and mixer continue to run in parallel
>>> mode,
>>> they intermix with lapw0 from the second cycle and cause a crash,
>>> which can be seen from the output in case.dayfile:
>>>
>>> --------------------------------------------------------------
>>>      cycle 1     (Fri Nov 22 15:32:51 CET 2013)  (50/99 to go)
>>>
>>>>   lapw0       (15:32:51) >   lapw0    (15:32:51) >   lapw0
>>>> (15:32:51) >   lapw0    (15:32:51) >   lapw0
>>>      (15:32:51) >   lapw0    (15:32:51) 44.798u 0.244s 0:45.75 98.4%
>>> 0+0k 0+0io 0pf+0w
>>>>   lapw1  -up          (15:33:37)
>>>>   lapw1  -up          (15:33:38)
>>>>   lapw1  -up          (15:33:38)
>>>>   lapw1  -up          (15:33:39)
>>>>   lapw1  -up          (15:33:39)
>>>>   lapw1  -up          (15:33:39)  _nb in dscgst.F         512 128
>>>   _nb in dscgst.F         512         128
>>>   _nb in dscgst.F         512         128
>>>   _nb in dscgst.F         512         128
>>>   _nb in dscgst.F         512         128
>>>   _nb in dscgst.F         512         128
>>>>   lapw1  -dn          (16:12:48)
>>>>   lapw1  -dn          (16:13:25)
>>>>   lapw1  -dn          (16:13:29)
>>>>   lapw1  -dn          (16:13:30)
>>>>   lapw1  -dn          (16:13:42)
>>>>   lapw1  -dn          (16:13:47)  _nb in dscgst.F         512 128
>>>   _nb in dscgst.F         512         128
>>>   _nb in dscgst.F         512         128
>>>   _nb in dscgst.F         512         128
>>>   _nb in dscgst.F         512         128
>>>   _nb in dscgst.F         512         128
>>>>   lapw2 -up           (17:07:01)
>>>>   lapw2 -up           (17:07:57)
>>>>   lapw2 -up           (17:08:44)
>>>>   lapw2 -up           (17:08:52)
>>>>   lapw2 -dn           (17:09:00)
>>>>   lapw2 -up           (17:09:01)
>>>>   lapw2 -up           (17:09:02)
>>>>   lapw2 -dn           (17:09:52)
>>>>   lapw2 -dn           (17:10:40)
>>>>   lapw2 -dn           (17:10:56)
>>>>   lapw2 -dn           (17:11:03)
>>>>   lapw2 -dn           (17:11:13)
>>>>   lcore -up   (17:11:40) 0.124u 0.024s 0:00.33 42.4%  0+0k 0+0io
>>>> 0pf+0w
>>>>   lcore -dn   (17:11:41) 0.120u 0.024s 0:00.30 46.6%  0+0k 0+0io
>>>> 0pf+0w
>>>>   mixer       (17:11:42) 0.172u 0.092s 0:00.58 44.8%  0+0k 0+0io
>>>> 0pf+0w
>>> error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/mixer mixer.def
>>> failed
>>>
>>>>   stop error
>>>>   lcore -up   (17:12:15) 0.132u 0.012s 0:00.20 70.0%  0+0k 0+0io
>>>> 0pf+0w
>>>>   lcore -dn   (17:12:15) 0.128u 0.012s 0:00.20 65.0%  0+0k 0+0io
>>>> 0pf+0w
>>>>   mixer       (17:11:42) 0.172u 0.092s 0:00.58 44.8%  0+0k 0+0io
>>>> 0pf+0w
>>> error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/mixer mixer.def
>>> failed
>>>
>>>>   stop error
>>>>   lcore -up   (17:12:15) 0.132u 0.012s 0:00.20 70.0%  0+0k 0+0io
>>>> 0pf+0w
>>>>   lcore -dn   (17:12:15) 0.128u 0.012s 0:00.20 65.0%  0+0k 0+0io
>>>> 0pf+0w
>>>>   mixer       (17:12:16) 0.680u 0.132s 0:02.28 35.5%  0+0k 0+0io
>>>> 0pf+0w
>>> :ENERGY convergence:  0 0 0
>>> :CHARGE convergence:  0 0.0001 0
>>>
>>>      cycle 2     (Fri Nov 22 17:12:18 CET 2013)  (49/98 to go)
>>>
>>>>   lapw0       (17:12:18)
>>>>   lcore -up   (17:12:58) 0.000u 0.008s 0:00.00 0.0%   0+0k 0+0io
>>>> 0pf+0w
>>> error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/lcore
>>> uplcore.def failed
>>>
>>>>   stop error
>>>>   lcore -up   (17:13:02) 0.000u 0.008s 0:00.00 0.0%   0+0k 0+0io
>>>> 0pf+0w
>>> error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/lcore
>>> uplcore.def failed
>>>
>>>>   stop error
>>> ------------------------------------------------------------------------------
>>>
>>> It looks like the .machines file needs some additional details about
>>> the calculation mode for lcore and mixer. How to configure properly
>>> the
>>> .machines in this case?
>>>
>>>
>>> Best regards, N.Pavlenko
>>>
>>>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what
nobody else has thought"
Albert Szent-Gyorgi