[Wien] [WIEN2k] abort of CPU core parallel jobs in NMR calculations of the current

Michael Fechtelkord Michael.Fechtelkord at ruhr-uni-bochum.de
Sun May 12 10:09:26 CEST 2024


Hello all, hello Peter,


That is what is really running in the background (from htop: this is a 
new job with 4 nodes but it was the same with 8 nodes -p 1 - 8), so no 
nmr_mpi.


TIME+ Command

96.0 14.9 19h06:05 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode current 
-green -scratch /scratch/WIEN2k/ -noco -p 3

95.8 14.9 19h05:10 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode current 
-green -scratch /scratch/WIEN2k/ -noco -p 1

95.1 14.9 19h06:00 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode current 
-green -scratch /scratch/WIEN2K/ -noco -p 2

95.5 15.4 19h08:10 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode current 
-green -scratch /scratch/WIEN2k/ -noco -p 4

94.6 14.9 18h35:33 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode current 
-green -scratch /scratch/WIEN2k/ -noco -p 3

93.3 15.4 18h36:24 /usr/local/WIEN2k/nmr-case MS_2M1_Al2 -mode current 
-green -scratch /scratch/WIEN2k/ -noco -p 4

93.3 14.9 18h33:02 /usr/local/WIEN2k/nmr-case MS_2M1_A12 -mode current 
-green -scratch/scratch/WIEN2k/ -noco -p2

94.0 14.9 18h38:44 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode current 
-green -scratch /scratch/WIEN2k/ -noco -p 1


Regards,

Michael


Am 11.05.2024 um 20:10 schrieb Michael Fechtelkord via Wien:
> Hello Peter,
>
>
> I just use "x_nmr_lapw -p" and the rest is initiated by the nmr 
> script. The Line "/usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode 
> current -green         -scratch /scratch/WIEN2k/ -noco " is just part 
> of the whole procedure and not initiated by me manually.. (I only 
> copied the last lines of the calculation).
>
>
> Best regards,
>
> Michael
>
>
> Am 11.05.2024 um 18:08 schrieb Peter Blaha:
>> Hallo Michael,
>>
>> I don't understand the line:
>>
>> /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode current 
>> -green         -scratch /scratch/WIEN2k/ -noco
>>
>> The mode current should run only k-parallel, not in mpi ??
>>
>> PS: The repetition of
>>
>> nmr_integ:localhost    is useless.
>>
>> nmr mode integ runs only once (not k-parallel, sumpara has already 
>> summed up the currents)
>>
>> But one can use       nmr_integ:localhost:8
>>
>>
>> Best regards
>>
>> Am 11.05.2024 um 16:19 schrieb Michael Fechtelkord via Wien:
>>> Hello Peter,
>>>
>>> this is the .machines file content:
>>>
>>> granulartity:1
>>> omp_lapw0:8
>>> omp_global:2
>>> 1:localhost
>>> 1:localhost
>>> 1:localhost
>>> 1:localhost
>>> 1:localhost
>>> 1:localhost
>>> 1:localhost
>>> 1:localhost
>>> nmr_integ:localhost
>>> nmr_integ:localhost
>>> nmr_integ:localhost
>>> nmr_integ:localhost
>>> nmr_integ:localhost
>>> nmr_integ:localhost
>>> nmr_integ:localhost
>>> nmr_integ:localhost
>>>
>>>
>>> Best regards,
>>>
>>> Michael
>>>
>>>
>>> Am 11.05.2024 um 14:58 schrieb Peter Blaha:
>>>> Hmm. ?
>>>>
>>>> Are you using   k-parallel  AND  mpi-parallel ??  This could 
>>>> overload the machine.
>>>>
>>>> How does the .machines file look like ?
>>>>
>>>>
>>>> Am 10.05.2024 um 18:15 schrieb Michael Fechtelkord via Wien:
>>>>> Dear all,
>>>>>
>>>>>
>>>>> the following problem occurs to me using the NMR part of WIEN2k 
>>>>> (23.2) on a opensuse LEAP 15.5 Intel platform. WIEN2k was compiled 
>>>>> using one-api 2024.1 ifort and gcc 13.2.1. I am using ELPA 
>>>>> 2024.03.01, Libxc 6.22, fftw 3.3.10 and MPICH 4.2.1 and the 
>>>>> one-api 2024.1 MKL libraries. The CPU is a I9 14900k with 24 cores 
>>>>> where I use eight for the calculations. The RAM is 130 Gb and a 
>>>>> swap file of 16 GB on a Samsung PCIE 4.0 NVME SSD. The BUS width 
>>>>> is 5600 MT / s.
>>>>>
>>>>> The structure is a layersilicate and to simulate the ratio of 
>>>>> Si:Al = 3:1 I use a 1:1:2 supercell currently. The monoclinic 
>>>>> symmetry of the new structure (original is C 2/c) is P 2/c and 
>>>>> contains 40 atoms (K, Al, Si, O, and F).
>>>>>
>>>>> I use 3 NMR LOs for K and O and 10 for Si, Al, and F (where I need 
>>>>> the chemical shifts). The k mesh is 40k points.
>>>>>
>>>>> The interesting thing is that the RAM is sufficient during NMR 
>>>>> vector calculations (always under 100 Gb RAM occupied) and at the 
>>>>> beginning of the electron current calculation. However, the RAM 
>>>>> use increases to a critical point in the calculation and more and 
>>>>> more data is outsourced into the SWAP File which is sometimes 80% 
>>>>> occupied.
>>>>>
>>>>> As you see this time only one core failed because of memory 
>>>>> overflow. But using 48k points 3 cores crashed and so the whole 
>>>>> current calculation. The reason is of the crash clear to me. But I 
>>>>> do not understand, why the current calculation reacts so sensitive 
>>>>> with so few atoms and a small k mesh. I made calculations with 
>>>>> more atoms and a 1000K point mesh on 4 cores .. they worked fine. 
>>>>> So can it be that the Intel MKL library is the source of failure? 
>>>>> So I better get back to 4 cores, even with longer calculation times?
>>>>>
>>>>> Have all a nice weekend!
>>>>>
>>>>>
>>>>> Best wishes from
>>>>>
>>>>> Michael Fechtelkord
>>>>>
>>>>> -----------------------------------------------
>>>>>
>>>>> cd ./  ...  x lcore  -f MS_2M1_Al2
>>>>>  CORE  END
>>>>> 0.685u 0.028s 0:00.71 98.5%     0+0k 2336+16168io 5pf+0w
>>>>>
>>>>> lcore        ....  ready
>>>>>
>>>>>
>>>>>  EXECUTING:     /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode 
>>>>> current    -green         -scratch /scratch/WIEN2k/ -noco
>>>>>
>>>>> [1] 20253
>>>>> [2] 20257
>>>>> [3] 20261
>>>>> [4] 20265
>>>>> [5] 20269
>>>>> [6] 20273
>>>>> [7] 20277
>>>>> [8] 20281
>>>>> [8]  + Abgebrochen                   ( cd $dir; $exec2 >> 
>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>> [7]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>> [6]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>> [5]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>> [4]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>> [3]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>> [2]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>> [1]  + Fertig                        ( cd $dir; $exec2 >> 
>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>>>>>
>>>>>  EXECUTING:     /usr/local/WIEN2k/nmr -case MS_2M1_Al2 -mode 
>>>>> sumpara  -p 8    -green -scratch /scratch/WIEN2k/
>>>>>
>>>>>
>>>>> current        ....  ready
>>>>>
>>>>>
>>>>>  EXECUTING:     mpirun -np 1 -machinefile .machine_nmrinteg 
>>>>> /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode integ -green
>>>>>
>>>>>
>>>>> nmr:  integration  ... done in   4032.3s
>>>>>
>>>>>
>>>>> stop
>>>>>
-- 
Dr. Michael Fechtelkord

Institut für Geologie, Mineralogie und Geophysik
Ruhr-Universität Bochum
Universitätsstr. 150
D-44780 Bochum

Phone: +49 (234) 32-24380
Fax:  +49 (234) 32-04380
Email:Michael.Fechtelkord at ruhr-uni-bochum.de
Web Page:https://www.ruhr-uni-bochum.de/kristallographie/kc/mitarbeiter/fechtelkord/



More information about the Wien mailing list