[Wien] [WIEN2k] abort of CPU core parallel jobs in NMR calculations of the current

Michael Fechtelkord Michael.Fechtelkord at ruhr-uni-bochum.de
Tue May 14 07:46:36 CEST 2024


Hello all,


just a short final note following the "-quota 8" option running on 8 
nodes. (from Peter: "PPS:   -quota 8 (or 24)  might help and still 
utilizing all cores, but I'm not sure if it would save enough memory in 
the current steps.")

I did run the nmr calculation with "x_nmr_lapw -p -quota 8". There is 
not really a difference to the previous runs without using quota 
concerning RAM in -mode current step. The calculation occupies 122 GB of 
RAM out of 128 GB and 20 Gb of Swap out of 32 Gb.

I will user only 4 nodes for further NMR calculations.


Best regards,

Michael


Am 13.05.2024 um 10:00 schrieb Michael Fechtelkord via Wien:
> Hello all,
>
>
> as far as I can see it, a job with 8 cores may be faster, but uses 
> double of the space on scratch (8 partial nmr vectors with size 
> depending on the kmesh per direction eg. nmr_mqx instead of 4 partial 
> vectors) and that also doubles the RAM usage of the NMR current 
> calculation because 8 partial vectors per direction are used.
>
> I will try the -quota 8 option, but currently it seems that 
> calculations on eight cores  are at high risk to crash because of the 
> memory and scratch space it needs and that already for 40k points. I 
> never had problems with calculations on 4 cores even with only 64 GB 
> RAM and 1000k points.
>
>
> Best regards,
>
> Michael
>
>
> Am 12.05.2024 um 18:02 schrieb Michael Fechtelkord via Wien:
>> It shows  EXECUTING: /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode 
>> current -green         -scratch /scratch/WIEN2k/ -noco
>>
>> in all cases and in htop the values I provided below.
>>
>>
>> Best regards,
>>
>> Michael
>>
>>
>> Am 12.05.2024 um 16:01 schrieb Peter Blaha:
>>> This makes sense.
>>> Please let me know if it shows
>>>
>>>  EXECUTING:     /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode 
>>> current    -green         -scratch /scratch/WIEN2k/ -noco
>>>
>>> or only    nmr -case ...
>>>
>>> In any case, it is running correctly.
>>>
>>> PS: I know that also the current step needs a lot of memory, after 
>>> all it needs to read the eigenvectors of all eigenvalues, ...
>>>
>>> PPS:   -quota 8 (or 24)  might help and still utilizing all cores, 
>>> but I'm not sure if it would save enough memory in the current steps.
>>>
>>>
>>>
>>> Am 12.05.2024 um 10:09 schrieb Michael Fechtelkord via Wien:
>>>> Hello all, hello Peter,
>>>>
>>>>
>>>> That is what is really running in the background (from htop: this 
>>>> is a new job with 4 nodes but it was the same with 8 nodes -p 1 - 
>>>> 8), so no nmr_mpi.
>>>>
>>>>
>>>> TIME+ Command
>>>>
>>>> 96.0 14.9 19h06:05 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode 
>>>> current -green -scratch /scratch/WIEN2k/ -noco -p 3
>>>>
>>>> 95.8 14.9 19h05:10 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode 
>>>> current -green -scratch /scratch/WIEN2k/ -noco -p 1
>>>>
>>>> 95.1 14.9 19h06:00 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode 
>>>> current -green -scratch /scratch/WIEN2K/ -noco -p 2
>>>>
>>>> 95.5 15.4 19h08:10 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode 
>>>> current -green -scratch /scratch/WIEN2k/ -noco -p 4
>>>>
>>>> 94.6 14.9 18h35:33 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode 
>>>> current -green -scratch /scratch/WIEN2k/ -noco -p 3
>>>>
>>>> 93.3 15.4 18h36:24 /usr/local/WIEN2k/nmr-case MS_2M1_Al2 -mode 
>>>> current -green -scratch /scratch/WIEN2k/ -noco -p 4
>>>>
>>>> 93.3 14.9 18h33:02 /usr/local/WIEN2k/nmr-case MS_2M1_A12 -mode 
>>>> current -green -scratch/scratch/WIEN2k/ -noco -p2
>>>>
>>>> 94.0 14.9 18h38:44 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode 
>>>> current -green -scratch /scratch/WIEN2k/ -noco -p 1
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Michael
>>>>
>>>>
>>>> Am 11.05.2024 um 20:10 schrieb Michael Fechtelkord via Wien:
>>>>> Hello Peter,
>>>>>
>>>>>
>>>>> I just use "x_nmr_lapw -p" and the rest is initiated by the nmr 
>>>>> script. The Line "/usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode 
>>>>> current -green         -scratch /scratch/WIEN2k/ -noco " is just 
>>>>> part of the whole procedure and not initiated by me manually.. (I 
>>>>> only copied the last lines of the calculation).
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>> Am 11.05.2024 um 18:08 schrieb Peter Blaha:
>>>>>> Hallo Michael,
>>>>>>
>>>>>> I don't understand the line:
>>>>>>
>>>>>> /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode current 
>>>>>> -green         -scratch /scratch/WIEN2k/ -noco
>>>>>>
>>>>>> The mode current should run only k-parallel, not in mpi ??
>>>>>>
>>>>>> PS: The repetition of
>>>>>>
>>>>>> nmr_integ:localhost    is useless.
>>>>>>
>>>>>> nmr mode integ runs only once (not k-parallel, sumpara has 
>>>>>> already summed up the currents)
>>>>>>
>>>>>> But one can use       nmr_integ:localhost:8
>>>>>>
>>>>>>
>>>>>> Best regards
>>>>>>
>>>>>> Am 11.05.2024 um 16:19 schrieb Michael Fechtelkord via Wien:
>>>>>>> Hello Peter,
>>>>>>>
>>>>>>> this is the .machines file content:
>>>>>>>
>>>>>>> granulartity:1
>>>>>>> omp_lapw0:8
>>>>>>> omp_global:2
>>>>>>> 1:localhost
>>>>>>> 1:localhost
>>>>>>> 1:localhost
>>>>>>> 1:localhost
>>>>>>> 1:localhost
>>>>>>> 1:localhost
>>>>>>> 1:localhost
>>>>>>> 1:localhost
>>>>>>> nmr_integ:localhost
>>>>>>> nmr_integ:localhost
>>>>>>> nmr_integ:localhost
>>>>>>> nmr_integ:localhost
>>>>>>> nmr_integ:localhost
>>>>>>> nmr_integ:localhost
>>>>>>> nmr_integ:localhost
>>>>>>> nmr_integ:localhost
>>>>>>>
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>> Am 11.05.2024 um 14:58 schrieb Peter Blaha:
>>>>>>>> Hmm. ?
>>>>>>>>
>>>>>>>> Are you using   k-parallel  AND  mpi-parallel ?? This could 
>>>>>>>> overload the machine.
>>>>>>>>
>>>>>>>> How does the .machines file look like ?
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 10.05.2024 um 18:15 schrieb Michael Fechtelkord via Wien:
>>>>>>>>> Dear all,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> the following problem occurs to me using the NMR part of 
>>>>>>>>> WIEN2k (23.2) on a opensuse LEAP 15.5 Intel platform. WIEN2k 
>>>>>>>>> was compiled using one-api 2024.1 ifort and gcc 13.2.1. I am 
>>>>>>>>> using ELPA 2024.03.01, Libxc 6.22, fftw 3.3.10 and MPICH 4.2.1 
>>>>>>>>> and the one-api 2024.1 MKL libraries. The CPU is a I9 14900k 
>>>>>>>>> with 24 cores where I use eight for the calculations. The RAM 
>>>>>>>>> is 130 Gb and a swap file of 16 GB on a Samsung PCIE 4.0 NVME 
>>>>>>>>> SSD. The BUS width is 5600 MT / s.
>>>>>>>>>
>>>>>>>>> The structure is a layersilicate and to simulate the ratio of 
>>>>>>>>> Si:Al = 3:1 I use a 1:1:2 supercell currently. The monoclinic 
>>>>>>>>> symmetry of the new structure (original is C 2/c) is P 2/c and 
>>>>>>>>> contains 40 atoms (K, Al, Si, O, and F).
>>>>>>>>>
>>>>>>>>> I use 3 NMR LOs for K and O and 10 for Si, Al, and F (where I 
>>>>>>>>> need the chemical shifts). The k mesh is 40k points.
>>>>>>>>>
>>>>>>>>> The interesting thing is that the RAM is sufficient during NMR 
>>>>>>>>> vector calculations (always under 100 Gb RAM occupied) and at 
>>>>>>>>> the beginning of the electron current calculation. However, 
>>>>>>>>> the RAM use increases to a critical point in the calculation 
>>>>>>>>> and more and more data is outsourced into the SWAP File which 
>>>>>>>>> is sometimes 80% occupied.
>>>>>>>>>
>>>>>>>>> As you see this time only one core failed because of memory 
>>>>>>>>> overflow. But using 48k points 3 cores crashed and so the 
>>>>>>>>> whole current calculation. The reason is of the crash clear to 
>>>>>>>>> me. But I do not understand, why the current calculation 
>>>>>>>>> reacts so sensitive with so few atoms and a small k mesh. I 
>>>>>>>>> made calculations with more atoms and a 1000K point mesh on 4 
>>>>>>>>> cores .. they worked fine. So can it be that the Intel MKL 
>>>>>>>>> library is the source of failure? So I better get back to 4 
>>>>>>>>> cores, even with longer calculation times?
>>>>>>>>>
>>>>>>>>> Have all a nice weekend!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best wishes from
>>>>>>>>>
>>>>>>>>> Michael Fechtelkord
>>>>>>>>>
>>>>>>>>> -----------------------------------------------
>>>>>>>>>
>>>>>>>>> cd ./  ...  x lcore  -f MS_2M1_Al2
>>>>>>>>>  CORE  END
>>>>>>>>> 0.685u 0.028s 0:00.71 98.5%     0+0k 2336+16168io 5pf+0w
>>>>>>>>>
>>>>>>>>> lcore        ....  ready
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  EXECUTING:     /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 
>>>>>>>>> -mode current    -green -scratch /scratch/WIEN2k/ -noco
>>>>>>>>>
>>>>>>>>> [1] 20253
>>>>>>>>> [2] 20257
>>>>>>>>> [3] 20261
>>>>>>>>> [4] 20265
>>>>>>>>> [5] 20269
>>>>>>>>> [6] 20273
>>>>>>>>> [7] 20277
>>>>>>>>> [8] 20281
>>>>>>>>> [8]  + Abgebrochen                   ( cd $dir; $exec2 >> 
>>>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>>>>> [7]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>>>>> [6]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>>>>> [5]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>>>>> [4]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>>>>> [3]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>>>>> [2]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>>>>> [1]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>>>>>
>>>>>>>>>  EXECUTING:     /usr/local/WIEN2k/nmr -case MS_2M1_Al2 -mode 
>>>>>>>>> sumpara  -p 8    -green -scratch /scratch/WIEN2k/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> current        ....  ready
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  EXECUTING:     mpirun -np 1 -machinefile .machine_nmrinteg 
>>>>>>>>> /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode integ -green
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> nmr:  integration  ... done in   4032.3s
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> stop
>>>>>>>>>
>>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at: 
>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
-- 
Dr. Michael Fechtelkord

Institut für Geologie, Mineralogie und Geophysik
Ruhr-Universität Bochum
Universitätsstr. 150
D-44780 Bochum

Phone: +49 (234) 32-24380
Fax:  +49 (234) 32-04380
Email: Michael.Fechtelkord at ruhr-uni-bochum.de
Web Page: https://www.ruhr-uni-bochum.de/kristallographie/kc/mitarbeiter/fechtelkord/



More information about the Wien mailing list