[Wien] [WIEN2k] abort of CPU core parallel jobs in NMR calculations of the current

Michael Fechtelkord Michael.Fechtelkord at ruhr-uni-bochum.de
Sun May 12 18:02:20 CEST 2024


It shows  EXECUTING:     /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 
-mode current    -green         -scratch /scratch/WIEN2k/ -noco

in all cases and in htop the values I provided below.


Best regards,

Michael


Am 12.05.2024 um 16:01 schrieb Peter Blaha:
> This makes sense.
> Please let me know if it shows
>
>  EXECUTING:     /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode 
> current    -green         -scratch /scratch/WIEN2k/ -noco
>
> or only    nmr -case ...
>
> In any case, it is running correctly.
>
> PS: I know that also the current step needs a lot of memory, after all 
> it needs to read the eigenvectors of all eigenvalues, ...
>
> PPS:   -quota 8 (or 24)  might help and still utilizing all cores, but 
> I'm not sure if it would save enough memory in the current steps.
>
>
>
> Am 12.05.2024 um 10:09 schrieb Michael Fechtelkord via Wien:
>> Hello all, hello Peter,
>>
>>
>> That is what is really running in the background (from htop: this is 
>> a new job with 4 nodes but it was the same with 8 nodes -p 1 - 8), so 
>> no nmr_mpi.
>>
>>
>> TIME+ Command
>>
>> 96.0 14.9 19h06:05 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode 
>> current -green -scratch /scratch/WIEN2k/ -noco -p 3
>>
>> 95.8 14.9 19h05:10 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode 
>> current -green -scratch /scratch/WIEN2k/ -noco -p 1
>>
>> 95.1 14.9 19h06:00 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode 
>> current -green -scratch /scratch/WIEN2K/ -noco -p 2
>>
>> 95.5 15.4 19h08:10 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode 
>> current -green -scratch /scratch/WIEN2k/ -noco -p 4
>>
>> 94.6 14.9 18h35:33 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode 
>> current -green -scratch /scratch/WIEN2k/ -noco -p 3
>>
>> 93.3 15.4 18h36:24 /usr/local/WIEN2k/nmr-case MS_2M1_Al2 -mode 
>> current -green -scratch /scratch/WIEN2k/ -noco -p 4
>>
>> 93.3 14.9 18h33:02 /usr/local/WIEN2k/nmr-case MS_2M1_A12 -mode 
>> current -green -scratch/scratch/WIEN2k/ -noco -p2
>>
>> 94.0 14.9 18h38:44 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode 
>> current -green -scratch /scratch/WIEN2k/ -noco -p 1
>>
>>
>> Regards,
>>
>> Michael
>>
>>
>> Am 11.05.2024 um 20:10 schrieb Michael Fechtelkord via Wien:
>>> Hello Peter,
>>>
>>>
>>> I just use "x_nmr_lapw -p" and the rest is initiated by the nmr 
>>> script. The Line "/usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode 
>>> current -green         -scratch /scratch/WIEN2k/ -noco " is just 
>>> part of the whole procedure and not initiated by me manually.. (I 
>>> only copied the last lines of the calculation).
>>>
>>>
>>> Best regards,
>>>
>>> Michael
>>>
>>>
>>> Am 11.05.2024 um 18:08 schrieb Peter Blaha:
>>>> Hallo Michael,
>>>>
>>>> I don't understand the line:
>>>>
>>>> /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode current 
>>>> -green         -scratch /scratch/WIEN2k/ -noco
>>>>
>>>> The mode current should run only k-parallel, not in mpi ??
>>>>
>>>> PS: The repetition of
>>>>
>>>> nmr_integ:localhost    is useless.
>>>>
>>>> nmr mode integ runs only once (not k-parallel, sumpara has already 
>>>> summed up the currents)
>>>>
>>>> But one can use       nmr_integ:localhost:8
>>>>
>>>>
>>>> Best regards
>>>>
>>>> Am 11.05.2024 um 16:19 schrieb Michael Fechtelkord via Wien:
>>>>> Hello Peter,
>>>>>
>>>>> this is the .machines file content:
>>>>>
>>>>> granulartity:1
>>>>> omp_lapw0:8
>>>>> omp_global:2
>>>>> 1:localhost
>>>>> 1:localhost
>>>>> 1:localhost
>>>>> 1:localhost
>>>>> 1:localhost
>>>>> 1:localhost
>>>>> 1:localhost
>>>>> 1:localhost
>>>>> nmr_integ:localhost
>>>>> nmr_integ:localhost
>>>>> nmr_integ:localhost
>>>>> nmr_integ:localhost
>>>>> nmr_integ:localhost
>>>>> nmr_integ:localhost
>>>>> nmr_integ:localhost
>>>>> nmr_integ:localhost
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>> Am 11.05.2024 um 14:58 schrieb Peter Blaha:
>>>>>> Hmm. ?
>>>>>>
>>>>>> Are you using   k-parallel  AND  mpi-parallel ??  This could 
>>>>>> overload the machine.
>>>>>>
>>>>>> How does the .machines file look like ?
>>>>>>
>>>>>>
>>>>>> Am 10.05.2024 um 18:15 schrieb Michael Fechtelkord via Wien:
>>>>>>> Dear all,
>>>>>>>
>>>>>>>
>>>>>>> the following problem occurs to me using the NMR part of WIEN2k 
>>>>>>> (23.2) on a opensuse LEAP 15.5 Intel platform. WIEN2k was 
>>>>>>> compiled using one-api 2024.1 ifort and gcc 13.2.1. I am using 
>>>>>>> ELPA 2024.03.01, Libxc 6.22, fftw 3.3.10 and MPICH 4.2.1 and the 
>>>>>>> one-api 2024.1 MKL libraries. The CPU is a I9 14900k with 24 
>>>>>>> cores where I use eight for the calculations. The RAM is 130 Gb 
>>>>>>> and a swap file of 16 GB on a Samsung PCIE 4.0 NVME SSD. The BUS 
>>>>>>> width is 5600 MT / s.
>>>>>>>
>>>>>>> The structure is a layersilicate and to simulate the ratio of 
>>>>>>> Si:Al = 3:1 I use a 1:1:2 supercell currently. The monoclinic 
>>>>>>> symmetry of the new structure (original is C 2/c) is P 2/c and 
>>>>>>> contains 40 atoms (K, Al, Si, O, and F).
>>>>>>>
>>>>>>> I use 3 NMR LOs for K and O and 10 for Si, Al, and F (where I 
>>>>>>> need the chemical shifts). The k mesh is 40k points.
>>>>>>>
>>>>>>> The interesting thing is that the RAM is sufficient during NMR 
>>>>>>> vector calculations (always under 100 Gb RAM occupied) and at 
>>>>>>> the beginning of the electron current calculation. However, the 
>>>>>>> RAM use increases to a critical point in the calculation and 
>>>>>>> more and more data is outsourced into the SWAP File which is 
>>>>>>> sometimes 80% occupied.
>>>>>>>
>>>>>>> As you see this time only one core failed because of memory 
>>>>>>> overflow. But using 48k points 3 cores crashed and so the whole 
>>>>>>> current calculation. The reason is of the crash clear to me. But 
>>>>>>> I do not understand, why the current calculation reacts so 
>>>>>>> sensitive with so few atoms and a small k mesh. I made 
>>>>>>> calculations with more atoms and a 1000K point mesh on 4 cores 
>>>>>>> .. they worked fine. So can it be that the Intel MKL library is 
>>>>>>> the source of failure? So I better get back to 4 cores, even 
>>>>>>> with longer calculation times?
>>>>>>>
>>>>>>> Have all a nice weekend!
>>>>>>>
>>>>>>>
>>>>>>> Best wishes from
>>>>>>>
>>>>>>> Michael Fechtelkord
>>>>>>>
>>>>>>> -----------------------------------------------
>>>>>>>
>>>>>>> cd ./  ...  x lcore  -f MS_2M1_Al2
>>>>>>>  CORE  END
>>>>>>> 0.685u 0.028s 0:00.71 98.5%     0+0k 2336+16168io 5pf+0w
>>>>>>>
>>>>>>> lcore        ....  ready
>>>>>>>
>>>>>>>
>>>>>>>  EXECUTING:     /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode 
>>>>>>> current    -green         -scratch /scratch/WIEN2k/ -noco
>>>>>>>
>>>>>>> [1] 20253
>>>>>>> [2] 20257
>>>>>>> [3] 20261
>>>>>>> [4] 20265
>>>>>>> [5] 20269
>>>>>>> [6] 20273
>>>>>>> [7] 20277
>>>>>>> [8] 20281
>>>>>>> [8]  + Abgebrochen                   ( cd $dir; $exec2 >> 
>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>>> [7]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>>> [6]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>>> [5]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>>> [4]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>>> [3]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>>> [2]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>>> [1]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>>>
>>>>>>>  EXECUTING:     /usr/local/WIEN2k/nmr -case MS_2M1_Al2 -mode 
>>>>>>> sumpara  -p 8    -green -scratch /scratch/WIEN2k/
>>>>>>>
>>>>>>>
>>>>>>> current        ....  ready
>>>>>>>
>>>>>>>
>>>>>>>  EXECUTING:     mpirun -np 1 -machinefile .machine_nmrinteg 
>>>>>>> /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode integ -green
>>>>>>>
>>>>>>>
>>>>>>> nmr:  integration  ... done in   4032.3s
>>>>>>>
>>>>>>>
>>>>>>> stop
>>>>>>>
>


More information about the Wien mailing list