[Wien] [WIEN2k] abort of CPU core parallel jobs in NMR calculations of the current

Michael Fechtelkord Michael.Fechtelkord at ruhr-uni-bochum.de
Mon May 13 10:32:19 CEST 2024


Dear Laurence,


I used 40 k-points.


The integration part makes no problems (-mode integ), the memory 
consuming part is the current part (-mode current).

Your hint for lapw1 shows even more that it would be safer to use 4 
parallel calculations instead of eight without loosing much perfomance 
(the 14900k has only 8 performance cores, the other 16 (efficient cores) 
are slower.


Best regards,

Michael


Am 13.05.2024 um 10:14 schrieb Laurence Marks:
> For my own curiosity, is it 40,000 k-points or 40 k-points?
>
> N.B., as Peter suggested, did you try using mpi, which would be four 
> of nmr_integ:localhost:2
> I suspect (but might be wrong) that this will reduce you memory useage 
> by a factor of 2, and will only be slightly slower than what you have. 
> If needed you can also go to 4 mpi. Of course you have to have 
> compiled it...
>
> N.N.B., you presumably realise that you are using 16 cores for lapw1, 
> as each k-point has 2 cores.
>
>
>
> On Mon, May 13, 2024 at 4:00 PM Michael Fechtelkord via Wien 
> <wien at zeus.theochem.tuwien.ac.at> wrote:
>
>     Hello all,
>
>
>     as far as I can see it, a job with 8 cores may be faster, but uses
>     double of the space on scratch (8 partial nmr vectors with size
>     depending on the kmesh per direction eg. nmr_mqx instead of 4 partial
>     vectors) and that also doubles the RAM usage of the NMR current
>     calculation because 8 partial vectors per direction are used.
>
>     I will try the -quota 8 option, but currently it seems that
>     calculations
>     on eight cores  are at high risk to crash because of the memory and
>     scratch space it needs and that already for 40k points. I never had
>     problems with calculations on 4 cores even with only 64 GB RAM and
>     1000k
>     points.
>
>
>     Best regards,
>
>     Michael
>
>
>     Am 12.05.2024 um 18:02 schrieb Michael Fechtelkord via Wien:
>     > It shows  EXECUTING:     /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2
>     > -mode current    -green         -scratch /scratch/WIEN2k/ -noco
>     >
>     > in all cases and in htop the values I provided below.
>     >
>     >
>     > Best regards,
>     >
>     > Michael
>     >
>     >
>     > Am 12.05.2024 um 16:01 schrieb Peter Blaha:
>     >> This makes sense.
>     >> Please let me know if it shows
>     >>
>     >>  EXECUTING:     /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode
>     >> current    -green         -scratch /scratch/WIEN2k/ -noco
>     >>
>     >> or only    nmr -case ...
>     >>
>     >> In any case, it is running correctly.
>     >>
>     >> PS: I know that also the current step needs a lot of memory, after
>     >> all it needs to read the eigenvectors of all eigenvalues, ...
>     >>
>     >> PPS:   -quota 8 (or 24)  might help and still utilizing all cores,
>     >> but I'm not sure if it would save enough memory in the current
>     steps.
>     >>
>     >>
>     >>
>     >> Am 12.05.2024 um 10:09 schrieb Michael Fechtelkord via Wien:
>     >>> Hello all, hello Peter,
>     >>>
>     >>>
>     >>> That is what is really running in the background (from htop:
>     this is
>     >>> a new job with 4 nodes but it was the same with 8 nodes -p 1 -
>     8),
>     >>> so no nmr_mpi.
>     >>>
>     >>>
>     >>> TIME+ Command
>     >>>
>     >>> 96.0 14.9 19h06:05 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode
>     >>> current -green -scratch /scratch/WIEN2k/ -noco -p 3
>     >>>
>     >>> 95.8 14.9 19h05:10 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode
>     >>> current -green -scratch /scratch/WIEN2k/ -noco -p 1
>     >>>
>     >>> 95.1 14.9 19h06:00 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode
>     >>> current -green -scratch /scratch/WIEN2K/ -noco -p 2
>     >>>
>     >>> 95.5 15.4 19h08:10 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode
>     >>> current -green -scratch /scratch/WIEN2k/ -noco -p 4
>     >>>
>     >>> 94.6 14.9 18h35:33 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode
>     >>> current -green -scratch /scratch/WIEN2k/ -noco -p 3
>     >>>
>     >>> 93.3 15.4 18h36:24 /usr/local/WIEN2k/nmr-case MS_2M1_Al2 -mode
>     >>> current -green -scratch /scratch/WIEN2k/ -noco -p 4
>     >>>
>     >>> 93.3 14.9 18h33:02 /usr/local/WIEN2k/nmr-case MS_2M1_A12 -mode
>     >>> current -green -scratch/scratch/WIEN2k/ -noco -p2
>     >>>
>     >>> 94.0 14.9 18h38:44 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode
>     >>> current -green -scratch /scratch/WIEN2k/ -noco -p 1
>     >>>
>     >>>
>     >>> Regards,
>     >>>
>     >>> Michael
>     >>>
>     >>>
>     >>> Am 11.05.2024 um 20:10 schrieb Michael Fechtelkord via Wien:
>     >>>> Hello Peter,
>     >>>>
>     >>>>
>     >>>> I just use "x_nmr_lapw -p" and the rest is initiated by the nmr
>     >>>> script. The Line "/usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2
>     -mode
>     >>>> current -green         -scratch /scratch/WIEN2k/ -noco " is just
>     >>>> part of the whole procedure and not initiated by me
>     manually.. (I
>     >>>> only copied the last lines of the calculation).
>     >>>>
>     >>>>
>     >>>> Best regards,
>     >>>>
>     >>>> Michael
>     >>>>
>     >>>>
>     >>>> Am 11.05.2024 um 18:08 schrieb Peter Blaha:
>     >>>>> Hallo Michael,
>     >>>>>
>     >>>>> I don't understand the line:
>     >>>>>
>     >>>>> /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode current
>     >>>>> -green         -scratch /scratch/WIEN2k/ -noco
>     >>>>>
>     >>>>> The mode current should run only k-parallel, not in mpi ??
>     >>>>>
>     >>>>> PS: The repetition of
>     >>>>>
>     >>>>> nmr_integ:localhost    is useless.
>     >>>>>
>     >>>>> nmr mode integ runs only once (not k-parallel, sumpara has
>     already
>     >>>>> summed up the currents)
>     >>>>>
>     >>>>> But one can use nmr_integ:localhost:8
>     >>>>>
>     >>>>>
>     >>>>> Best regards
>     >>>>>
>     >>>>> Am 11.05.2024 um 16:19 schrieb Michael Fechtelkord via Wien:
>     >>>>>> Hello Peter,
>     >>>>>>
>     >>>>>> this is the .machines file content:
>     >>>>>>
>     >>>>>> granulartity:1
>     >>>>>> omp_lapw0:8
>     >>>>>> omp_global:2
>     >>>>>> 1:localhost
>     >>>>>> 1:localhost
>     >>>>>> 1:localhost
>     >>>>>> 1:localhost
>     >>>>>> 1:localhost
>     >>>>>> 1:localhost
>     >>>>>> 1:localhost
>     >>>>>> 1:localhost
>     >>>>>> nmr_integ:localhost
>     >>>>>> nmr_integ:localhost
>     >>>>>> nmr_integ:localhost
>     >>>>>> nmr_integ:localhost
>     >>>>>> nmr_integ:localhost
>     >>>>>> nmr_integ:localhost
>     >>>>>> nmr_integ:localhost
>     >>>>>> nmr_integ:localhost
>     >>>>>>
>     >>>>>>
>     >>>>>> Best regards,
>     >>>>>>
>     >>>>>> Michael
>     >>>>>>
>     >>>>>>
>     >>>>>> Am 11.05.2024 um 14:58 schrieb Peter Blaha:
>     >>>>>>> Hmm. ?
>     >>>>>>>
>     >>>>>>> Are you using   k-parallel  AND mpi-parallel ??  This could
>     >>>>>>> overload the machine.
>     >>>>>>>
>     >>>>>>> How does the .machines file look like ?
>     >>>>>>>
>     >>>>>>>
>     >>>>>>> Am 10.05.2024 um 18:15 schrieb Michael Fechtelkord via Wien:
>     >>>>>>>> Dear all,
>     >>>>>>>>
>     >>>>>>>>
>     >>>>>>>> the following problem occurs to me using the NMR part of
>     WIEN2k
>     >>>>>>>> (23.2) on a opensuse LEAP 15.5 Intel platform. WIEN2k was
>     >>>>>>>> compiled using one-api 2024.1 ifort and gcc 13.2.1. I am
>     using
>     >>>>>>>> ELPA 2024.03.01, Libxc 6.22, fftw 3.3.10 and MPICH 4.2.1 and
>     >>>>>>>> the one-api 2024.1 MKL libraries. The CPU is a I9 14900k
>     with
>     >>>>>>>> 24 cores where I use eight for the calculations. The RAM
>     is 130
>     >>>>>>>> Gb and a swap file of 16 GB on a Samsung PCIE 4.0 NVME
>     SSD. The
>     >>>>>>>> BUS width is 5600 MT / s.
>     >>>>>>>>
>     >>>>>>>> The structure is a layersilicate and to simulate the
>     ratio of
>     >>>>>>>> Si:Al = 3:1 I use a 1:1:2 supercell currently. The
>     monoclinic
>     >>>>>>>> symmetry of the new structure (original is C 2/c) is P
>     2/c and
>     >>>>>>>> contains 40 atoms (K, Al, Si, O, and F).
>     >>>>>>>>
>     >>>>>>>> I use 3 NMR LOs for K and O and 10 for Si, Al, and F
>     (where I
>     >>>>>>>> need the chemical shifts). The k mesh is 40k points.
>     >>>>>>>>
>     >>>>>>>> The interesting thing is that the RAM is sufficient
>     during NMR
>     >>>>>>>> vector calculations (always under 100 Gb RAM occupied)
>     and at
>     >>>>>>>> the beginning of the electron current calculation.
>     However, the
>     >>>>>>>> RAM use increases to a critical point in the calculation and
>     >>>>>>>> more and more data is outsourced into the SWAP File which is
>     >>>>>>>> sometimes 80% occupied.
>     >>>>>>>>
>     >>>>>>>> As you see this time only one core failed because of memory
>     >>>>>>>> overflow. But using 48k points 3 cores crashed and so the
>     whole
>     >>>>>>>> current calculation. The reason is of the crash clear to me.
>     >>>>>>>> But I do not understand, why the current calculation
>     reacts so
>     >>>>>>>> sensitive with so few atoms and a small k mesh. I made
>     >>>>>>>> calculations with more atoms and a 1000K point mesh on 4
>     cores
>     >>>>>>>> .. they worked fine. So can it be that the Intel MKL
>     library is
>     >>>>>>>> the source of failure? So I better get back to 4 cores, even
>     >>>>>>>> with longer calculation times?
>     >>>>>>>>
>     >>>>>>>> Have all a nice weekend!
>     >>>>>>>>
>     >>>>>>>>
>     >>>>>>>> Best wishes from
>     >>>>>>>>
>     >>>>>>>> Michael Fechtelkord
>     >>>>>>>>
>     >>>>>>>> -----------------------------------------------
>     >>>>>>>>
>     >>>>>>>> cd ./  ...  x lcore  -f MS_2M1_Al2
>     >>>>>>>>  CORE  END
>     >>>>>>>> 0.685u 0.028s 0:00.71 98.5%     0+0k 2336+16168io 5pf+0w
>     >>>>>>>>
>     >>>>>>>> lcore        ....  ready
>     >>>>>>>>
>     >>>>>>>>
>     >>>>>>>>  EXECUTING: /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2
>     >>>>>>>> -mode current -green         -scratch /scratch/WIEN2k/ -noco
>     >>>>>>>>
>     >>>>>>>> [1] 20253
>     >>>>>>>> [2] 20257
>     >>>>>>>> [3] 20261
>     >>>>>>>> [4] 20265
>     >>>>>>>> [5] 20269
>     >>>>>>>> [6] 20273
>     >>>>>>>> [7] 20277
>     >>>>>>>> [8] 20281
>     >>>>>>>> [8]  + Abgebrochen                   ( cd $dir; $exec2 >>
>     >>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>     >>>>>>>> [7]  + Fertig                        ( cd $dir; $exec2 >>
>     >>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>     >>>>>>>> [6]  + Fertig                        ( cd $dir; $exec2 >>
>     >>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>     >>>>>>>> [5]  + Fertig                        ( cd $dir; $exec2 >>
>     >>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>     >>>>>>>> [4]  + Fertig                        ( cd $dir; $exec2 >>
>     >>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>     >>>>>>>> [3]  + Fertig                        ( cd $dir; $exec2 >>
>     >>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>     >>>>>>>> [2]  + Fertig                        ( cd $dir; $exec2 >>
>     >>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>     >>>>>>>> [1]  + Fertig                        ( cd $dir; $exec2 >>
>     >>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
>     >>>>>>>>
>     >>>>>>>>  EXECUTING: /usr/local/WIEN2k/nmr -case MS_2M1_Al2 -mode
>     >>>>>>>> sumpara  -p 8    -green -scratch /scratch/WIEN2k/
>     >>>>>>>>
>     >>>>>>>>
>     >>>>>>>> current        ....  ready
>     >>>>>>>>
>     >>>>>>>>
>     >>>>>>>>  EXECUTING:     mpirun -np 1 -machinefile .machine_nmrinteg
>     >>>>>>>> /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode integ -green
>     >>>>>>>>
>     >>>>>>>>
>     >>>>>>>> nmr:  integration  ... done in   4032.3s
>     >>>>>>>>
>     >>>>>>>>
>     >>>>>>>> stop
>     >>>>>>>>
>     >>
>     > _______________________________________________
>     > Wien mailing list
>     > Wien at zeus.theochem.tuwien.ac.at
>     > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>     > SEARCH the MAILING-LIST at:
>     >
>     http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>     -- 
>     Dr. Michael Fechtelkord
>
>     Institut für Geologie, Mineralogie und Geophysik
>     Ruhr-Universität Bochum
>     Universitätsstr. 150
>     D-44780 Bochum
>
>     Phone: +49 (234) 32-24380
>     Fax:  +49 (234) 32-04380
>     Email: Michael.Fechtelkord at ruhr-uni-bochum.de
>     Web Page:
>     https://www.ruhr-uni-bochum.de/kristallographie/kc/mitarbeiter/fechtelkord/
>
>     _______________________________________________
>     Wien mailing list
>     Wien at zeus.theochem.tuwien.ac.at
>     http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>     SEARCH the MAILING-LIST at:
>     http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
>
> -- 
> Professor Laurence Marks (Laurie)
> Northwestern University
> Webpage <http://www.numis.northwestern.edu> and Google Scholar link 
> <http://scholar.google.com/citations?user=zmHhI9gAAAAJ&hl=en>
> "Research is to see what everybody else has seen, and to think what 
> nobody else has thought", Albert Szent-Györgyi
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

-- 
Dr. Michael Fechtelkord

Institut für Geologie, Mineralogie und Geophysik
Ruhr-Universität Bochum
Universitätsstr. 150
D-44780 Bochum

Phone: +49 (234) 32-24380
Fax:  +49 (234) 32-04380
Email:Michael.Fechtelkord at ruhr-uni-bochum.de
Web Page:https://www.ruhr-uni-bochum.de/kristallographie/kc/mitarbeiter/fechtelkord/  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20240513/fe3a379c/attachment-0001.htm>


More information about the Wien mailing list