[Wien] [WIEN2k] abort of CPU core parallel jobs in NMR calculations of the current

Peter Blaha peter.blaha at tuwien.ac.at
Sun May 12 16:01:55 CEST 2024


This makes sense.
Please let me know if it shows

  EXECUTING:     /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode 
current    -green         -scratch /scratch/WIEN2k/ -noco

or only    nmr -case ...

In any case, it is running correctly.

PS: I know that also the current step needs a lot of memory, after all 
it needs to read the eigenvectors of all eigenvalues, ...

PPS:   -quota 8 (or 24)  might help and still utilizing all cores, but 
I'm not sure if it would save enough memory in the current steps.



Am 12.05.2024 um 10:09 schrieb Michael Fechtelkord via Wien:
> Hello all, hello Peter,
> 
> 
> That is what is really running in the background (from htop: this is a 
> new job with 4 nodes but it was the same with 8 nodes -p 1 - 8), so no 
> nmr_mpi.
> 
> 
> TIME+ Command
> 
> 96.0 14.9 19h06:05 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode current 
> -green -scratch /scratch/WIEN2k/ -noco -p 3
> 
> 95.8 14.9 19h05:10 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode current 
> -green -scratch /scratch/WIEN2k/ -noco -p 1
> 
> 95.1 14.9 19h06:00 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode current 
> -green -scratch /scratch/WIEN2K/ -noco -p 2
> 
> 95.5 15.4 19h08:10 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode current 
> -green -scratch /scratch/WIEN2k/ -noco -p 4
> 
> 94.6 14.9 18h35:33 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode current 
> -green -scratch /scratch/WIEN2k/ -noco -p 3
> 
> 93.3 15.4 18h36:24 /usr/local/WIEN2k/nmr-case MS_2M1_Al2 -mode current 
> -green -scratch /scratch/WIEN2k/ -noco -p 4
> 
> 93.3 14.9 18h33:02 /usr/local/WIEN2k/nmr-case MS_2M1_A12 -mode current 
> -green -scratch/scratch/WIEN2k/ -noco -p2
> 
> 94.0 14.9 18h38:44 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode current 
> -green -scratch /scratch/WIEN2k/ -noco -p 1
> 
> 
> Regards,
> 
> Michael
> 
> 
> Am 11.05.2024 um 20:10 schrieb Michael Fechtelkord via Wien:
>> Hello Peter,
>>
>>
>> I just use "x_nmr_lapw -p" and the rest is initiated by the nmr 
>> script. The Line "/usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode 
>> current -green         -scratch /scratch/WIEN2k/ -noco " is just part 
>> of the whole procedure and not initiated by me manually.. (I only 
>> copied the last lines of the calculation).
>>
>>
>> Best regards,
>>
>> Michael
>>
>>
>> Am 11.05.2024 um 18:08 schrieb Peter Blaha:
>>> Hallo Michael,
>>>
>>> I don't understand the line:
>>>
>>> /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode current 
>>> -green         -scratch /scratch/WIEN2k/ -noco
>>>
>>> The mode current should run only k-parallel, not in mpi ??
>>>
>>> PS: The repetition of
>>>
>>> nmr_integ:localhost    is useless.
>>>
>>> nmr mode integ runs only once (not k-parallel, sumpara has already 
>>> summed up the currents)
>>>
>>> But one can use       nmr_integ:localhost:8
>>>
>>>
>>> Best regards
>>>
>>> Am 11.05.2024 um 16:19 schrieb Michael Fechtelkord via Wien:
>>>> Hello Peter,
>>>>
>>>> this is the .machines file content:
>>>>
>>>> granulartity:1
>>>> omp_lapw0:8
>>>> omp_global:2
>>>> 1:localhost
>>>> 1:localhost
>>>> 1:localhost
>>>> 1:localhost
>>>> 1:localhost
>>>> 1:localhost
>>>> 1:localhost
>>>> 1:localhost
>>>> nmr_integ:localhost
>>>> nmr_integ:localhost
>>>> nmr_integ:localhost
>>>> nmr_integ:localhost
>>>> nmr_integ:localhost
>>>> nmr_integ:localhost
>>>> nmr_integ:localhost
>>>> nmr_integ:localhost
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Michael
>>>>
>>>>
>>>> Am 11.05.2024 um 14:58 schrieb Peter Blaha:
>>>>> Hmm. ?
>>>>>
>>>>> Are you using   k-parallel  AND  mpi-parallel ??  This could 
>>>>> overload the machine.
>>>>>
>>>>> How does the .machines file look like ?
>>>>>
>>>>>
>>>>> Am 10.05.2024 um 18:15 schrieb Michael Fechtelkord via Wien:
>>>>>> Dear all,
>>>>>>
>>>>>>
>>>>>> the following problem occurs to me using the NMR part of WIEN2k 
>>>>>> (23.2) on a opensuse LEAP 15.5 Intel platform. WIEN2k was compiled 
>>>>>> using one-api 2024.1 ifort and gcc 13.2.1. I am using ELPA 
>>>>>> 2024.03.01, Libxc 6.22, fftw 3.3.10 and MPICH 4.2.1 and the 
>>>>>> one-api 2024.1 MKL libraries. The CPU is a I9 14900k with 24 cores 
>>>>>> where I use eight for the calculations. The RAM is 130 Gb and a 
>>>>>> swap file of 16 GB on a Samsung PCIE 4.0 NVME SSD. The BUS width 
>>>>>> is 5600 MT / s.
>>>>>>
>>>>>> The structure is a layersilicate and to simulate the ratio of 
>>>>>> Si:Al = 3:1 I use a 1:1:2 supercell currently. The monoclinic 
>>>>>> symmetry of the new structure (original is C 2/c) is P 2/c and 
>>>>>> contains 40 atoms (K, Al, Si, O, and F).
>>>>>>
>>>>>> I use 3 NMR LOs for K and O and 10 for Si, Al, and F (where I need 
>>>>>> the chemical shifts). The k mesh is 40k points.
>>>>>>
>>>>>> The interesting thing is that the RAM is sufficient during NMR 
>>>>>> vector calculations (always under 100 Gb RAM occupied) and at the 
>>>>>> beginning of the electron current calculation. However, the RAM 
>>>>>> use increases to a critical point in the calculation and more and 
>>>>>> more data is outsourced into the SWAP File which is sometimes 80% 
>>>>>> occupied.
>>>>>>
>>>>>> As you see this time only one core failed because of memory 
>>>>>> overflow. But using 48k points 3 cores crashed and so the whole 
>>>>>> current calculation. The reason is of the crash clear to me. But I 
>>>>>> do not understand, why the current calculation reacts so sensitive 
>>>>>> with so few atoms and a small k mesh. I made calculations with 
>>>>>> more atoms and a 1000K point mesh on 4 cores .. they worked fine. 
>>>>>> So can it be that the Intel MKL library is the source of failure? 
>>>>>> So I better get back to 4 cores, even with longer calculation times?
>>>>>>
>>>>>> Have all a nice weekend!
>>>>>>
>>>>>>
>>>>>> Best wishes from
>>>>>>
>>>>>> Michael Fechtelkord
>>>>>>
>>>>>> -----------------------------------------------
>>>>>>
>>>>>> cd ./  ...  x lcore  -f MS_2M1_Al2
>>>>>>  CORE  END
>>>>>> 0.685u 0.028s 0:00.71 98.5%     0+0k 2336+16168io 5pf+0w
>>>>>>
>>>>>> lcore        ....  ready
>>>>>>
>>>>>>
>>>>>>  EXECUTING:     /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode 
>>>>>> current    -green         -scratch /scratch/WIEN2k/ -noco
>>>>>>
>>>>>> [1] 20253
>>>>>> [2] 20257
>>>>>> [3] 20261
>>>>>> [4] 20265
>>>>>> [5] 20269
>>>>>> [6] 20273
>>>>>> [7] 20277
>>>>>> [8] 20281
>>>>>> [8]  + Abgebrochen                   ( cd $dir; $exec2 >> 
>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>> [7]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>> [6]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>> [5]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>> [4]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>> [3]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>> [2]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>> [1]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>>
>>>>>>  EXECUTING:     /usr/local/WIEN2k/nmr -case MS_2M1_Al2 -mode 
>>>>>> sumpara  -p 8    -green -scratch /scratch/WIEN2k/
>>>>>>
>>>>>>
>>>>>> current        ....  ready
>>>>>>
>>>>>>
>>>>>>  EXECUTING:     mpirun -np 1 -machinefile .machine_nmrinteg 
>>>>>> /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode integ -green
>>>>>>
>>>>>>
>>>>>> nmr:  integration  ... done in   4032.3s
>>>>>>
>>>>>>
>>>>>> stop
>>>>>>

-- 
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300
Email: peter.blaha at tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at
-------------------------------------------------------------------------


More information about the Wien mailing list