[Wien] Problem with wien2k 13.1 parallel for Slurm+intel mpi

Tue Nov 26 14:03:38 CET 2013

Dear Prof. Blaha,

thanks a lot for your reply. I have corrected the .machines file
(the node with 6 cores is automatically chosen):
-----------------
lapw0: alcc92:6
1:alcc92:6
granularity:1
extrafine:1
-----------------
but nevertheless got the following output in case.dayfile:
-------case.dayfile--------

Calculating case in 
/alcc/gpfs1/home/exp6/pavlenna/work/laosto/ovac/case
on alcc92 with PID 9804
using WIEN2k_13.1 (Release 17/6/2013) in 
/alcc/gpfs1/home/exp6/pavlenna/wien

     start       (Tue Nov 26 13:41:14 CET 2013) with lapw0 (50/99 to go)

     cycle 1     (Tue Nov 26 13:41:14 CET 2013)  (50/99 to go)

>   lapw0 -p    (13:41:15) starting parallel lapw0 at Tue Nov 26 
> 13:41:15 CET 2013
-------- .machine0 : 6 processors
0.024u 0.024s 0:12.00 0.3%      0+0k 1632+8io 6pf+0w
>   lapw1  -up -p       (13:41:27) starting parallel lapw1 at Tue Nov 26 
> 13:41:27 CET 2013
->  starting parallel LAPW1 jobs at Tue Nov 26 13:41:27 CET 2013
running LAPW1 in parallel mode (using .machines)
1 number_of_parallel_jobs
      alcc92 alcc92 alcc92 alcc92 alcc92 alcc92(6) 0.016u 0.004s 0:00.75 
1.3%    0+0k 0+8io 0pf+0w
    Summary of lapw1para:
    alcc92        k=0     user=0  wallclock=0
0.068u 0.020s 0:02.19 3.6%      0+0k 0+104io 0pf+0w
>   lapw1  -dn -p       (13:41:29) starting parallel lapw1 at Tue Nov 26 
> 13:41:29 CET 2013
->  starting parallel LAPW1 jobs at Tue Nov 26 13:41:29 CET 2013
running LAPW1 in parallel mode (using .machines.help)
1 number_of_parallel_jobs
      alcc92 alcc92 alcc92 alcc92 alcc92 alcc92(6) 0.020u 0.004s 0:00.42 
4.7%    0+0k 0+8io 0pf+0w
    Summary of lapw1para:
    alcc92        k=0     user=0  wallclock=0
0.072u 0.028s 0:02.11 4.2%      0+0k 0+104io 0pf+0w
>   lapw2 -up -p        (13:41:31) running LAPW2 in parallel mode
**  LAPW2 crashed!
0.248u 0.012s 0:00.73 34.2%     0+0k 8+16io 0pf+0w
error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/lapw2para -up 
uplapw2.def   failed

>   stop error
---------------------------------
In the uplapw2.err I have the following error messages:

Error in LAPW2
  'LAPW2' - can't open unit: 30
  'LAPW2' -        filename: case.energyup_1
**  testerror: Error in Parallel LAPW2
-----------------
and the following error output messages:

------------------
starting on alcc92
  LAPW0 END
  LAPW0 END
  LAPW0 END
  LAPW0 END
  LAPW0 END
  LAPW0 END
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
case.scf1up_1: No such file or directory.
grep: No match.
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7e356c) failed
PMPI_Comm_size(76).: Invalid communicator
case.scf1dn_1: No such file or directory.
grep: No match.
FERMI - Error
cp: cannot stat `.in.tmp': No such file or directory

>   stop error
-----------------------------------
Please let me know, maybe something is wrong in the mpi configuration.
I have an intel mpi installed on the cluster.

Best regards, N.Pavlenko

Am 2013-11-23 13:07, schrieb Peter Blaha:
> You completely misunderstand how parallelization in wien2k works.
> Please read the UG carefully (parallelization), also notice the
> k-parallel and mpi-parallel options and for which case they are 
> useful.
> 
> I'm not familiar with "Slurm", but it looks as if you ran 6 times the 
> same
> sequential job in parallel, overwriting the generated files all the 
> time.
> 
>> I have a problem with parallel run of Wien2k 13.1 on a cluster
>> with Slurm Environment+ Intel mpi.
>> In a test run for 1 node with 6 cpu cores, I
>> generated the following .machines file:
>> 
>> -------.machines
>> #
>> lapw0:alcc69
>> 1:alcc69:6
>> granularity:1
>> extrafine:1
> 
> this is ok, except the lapw0 line, which would not run in parallel. 
> Use
> 
> lapw0:alc69:6
> 
>> and  used the following command in the script:
>> srun -n 6  runsp_lapw -NI -cc 0.0001 -i 50
> 
> You are running 6 times  "runsp_lapw ..."
> 
> wien2k spans its parallelization itself (provided you have properly
> installed wien2k and specified the proper "mpirun ... command" during
> siteconfig), but you must add the  -p flag.
> 
> So the single command
> 
> runsp_lapw -NI -cc 0.0001 -i 50 -p
> 
> should start 6 parallel jobs (with your machines file mpi-parallel) 
> itself.
> (You only need to have permission to do so).
> 
>> 
>> In the first cycle, the lapw0,lapw1 and lapw2 are successfully
>> finished, but after that lcore and mixer continue to run in parallel 
>> mode,
>> they intermix with lapw0 from the second cycle and cause a crash,
>> which can be seen from the output in case.dayfile:
>> 
>> --------------------------------------------------------------
>>      cycle 1     (Fri Nov 22 15:32:51 CET 2013)  (50/99 to go)
>> 
>>>   lapw0       (15:32:51) >   lapw0    (15:32:51) >   lapw0 
>>> (15:32:51) >   lapw0    (15:32:51) >   lapw0
>>      (15:32:51) >   lapw0    (15:32:51) 44.798u 0.244s 0:45.75 98.4% 
>> 0+0k 0+0io 0pf+0w
>>>   lapw1  -up          (15:33:37)
>>>   lapw1  -up          (15:33:38)
>>>   lapw1  -up          (15:33:38)
>>>   lapw1  -up          (15:33:39)
>>>   lapw1  -up          (15:33:39)
>>>   lapw1  -up          (15:33:39)  _nb in dscgst.F         512 128
>>   _nb in dscgst.F         512         128
>>   _nb in dscgst.F         512         128
>>   _nb in dscgst.F         512         128
>>   _nb in dscgst.F         512         128
>>   _nb in dscgst.F         512         128
>>>   lapw1  -dn          (16:12:48)
>>>   lapw1  -dn          (16:13:25)
>>>   lapw1  -dn          (16:13:29)
>>>   lapw1  -dn          (16:13:30)
>>>   lapw1  -dn          (16:13:42)
>>>   lapw1  -dn          (16:13:47)  _nb in dscgst.F         512 128
>>   _nb in dscgst.F         512         128
>>   _nb in dscgst.F         512         128
>>   _nb in dscgst.F         512         128
>>   _nb in dscgst.F         512         128
>>   _nb in dscgst.F         512         128
>>>   lapw2 -up           (17:07:01)
>>>   lapw2 -up           (17:07:57)
>>>   lapw2 -up           (17:08:44)
>>>   lapw2 -up           (17:08:52)
>>>   lapw2 -dn           (17:09:00)
>>>   lapw2 -up           (17:09:01)
>>>   lapw2 -up           (17:09:02)
>>>   lapw2 -dn           (17:09:52)
>>>   lapw2 -dn           (17:10:40)
>>>   lapw2 -dn           (17:10:56)
>>>   lapw2 -dn           (17:11:03)
>>>   lapw2 -dn           (17:11:13)
>>>   lcore -up   (17:11:40) 0.124u 0.024s 0:00.33 42.4%  0+0k 0+0io 
>>> 0pf+0w
>>>   lcore -dn   (17:11:41) 0.120u 0.024s 0:00.30 46.6%  0+0k 0+0io 
>>> 0pf+0w
>>>   mixer       (17:11:42) 0.172u 0.092s 0:00.58 44.8%  0+0k 0+0io 
>>> 0pf+0w
>> error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/mixer mixer.def 
>> failed
>> 
>>>   stop error
>>>   lcore -up   (17:12:15) 0.132u 0.012s 0:00.20 70.0%  0+0k 0+0io 
>>> 0pf+0w
>>>   lcore -dn   (17:12:15) 0.128u 0.012s 0:00.20 65.0%  0+0k 0+0io 
>>> 0pf+0w
>>>   mixer       (17:11:42) 0.172u 0.092s 0:00.58 44.8%  0+0k 0+0io 
>>> 0pf+0w
>> error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/mixer mixer.def 
>> failed
>> 
>>>   stop error
>>>   lcore -up   (17:12:15) 0.132u 0.012s 0:00.20 70.0%  0+0k 0+0io 
>>> 0pf+0w
>>>   lcore -dn   (17:12:15) 0.128u 0.012s 0:00.20 65.0%  0+0k 0+0io 
>>> 0pf+0w
>>>   mixer       (17:12:16) 0.680u 0.132s 0:02.28 35.5%  0+0k 0+0io 
>>> 0pf+0w
>> :ENERGY convergence:  0 0 0
>> :CHARGE convergence:  0 0.0001 0
>> 
>>      cycle 2     (Fri Nov 22 17:12:18 CET 2013)  (49/98 to go)
>> 
>>>   lapw0       (17:12:18)
>>>   lcore -up   (17:12:58) 0.000u 0.008s 0:00.00 0.0%   0+0k 0+0io 
>>> 0pf+0w
>> error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/lcore 
>> uplcore.def failed
>> 
>>>   stop error
>>>   lcore -up   (17:13:02) 0.000u 0.008s 0:00.00 0.0%   0+0k 0+0io 
>>> 0pf+0w
>> error: command   /alcc/gpfs1/home/exp6/pavlenna/wien/lcore 
>> uplcore.def failed
>> 
>>>   stop error
>> ------------------------------------------------------------------------------
>> 
>> It looks like the .machines file needs some additional details about
>> the calculation mode for lcore and mixer. How to configure properly 
>> the
>> .machines in this case?
>> 
>> 
>> Best regards, N.Pavlenko
>> 
>>