[Wien] [WIEN2k] abort of CPU core parallel jobs in NMR calculations of the current

Laurence Marks laurence.marks at gmail.com
Mon May 13 10:14:40 CEST 2024


For my own curiosity, is it 40,000 k-points or 40 k-points?

N.B., as Peter suggested, did you try using mpi, which would be four of
nmr_integ:localhost:2
I suspect (but might be wrong) that this will reduce you memory useage by a
factor of 2, and will only be slightly slower than what you have. If needed
you can also go to 4 mpi. Of course you have to have compiled it...

N.N.B., you presumably realise that you are using 16 cores for lapw1, as
each k-point has 2 cores.



On Mon, May 13, 2024 at 4:00 PM Michael Fechtelkord via Wien <
wien at zeus.theochem.tuwien.ac.at> wrote:

> Hello all,
>
>
> as far as I can see it, a job with 8 cores may be faster, but uses
> double of the space on scratch (8 partial nmr vectors with size
> depending on the kmesh per direction eg. nmr_mqx instead of 4 partial
> vectors) and that also doubles the RAM usage of the NMR current
> calculation because 8 partial vectors per direction are used.
>
> I will try the -quota 8 option, but currently it seems that calculations
> on eight cores  are at high risk to crash because of the memory and
> scratch space it needs and that already for 40k points. I never had
> problems with calculations on 4 cores even with only 64 GB RAM and 1000k
> points.
>
>
> Best regards,
>
> Michael
>
>
> Am 12.05.2024 um 18:02 schrieb Michael Fechtelkord via Wien:
> > It shows  EXECUTING:     /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2
> > -mode current    -green         -scratch /scratch/WIEN2k/ -noco
> >
> > in all cases and in htop the values I provided below.
> >
> >
> > Best regards,
> >
> > Michael
> >
> >
> > Am 12.05.2024 um 16:01 schrieb Peter Blaha:
> >> This makes sense.
> >> Please let me know if it shows
> >>
> >>  EXECUTING:     /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode
> >> current    -green         -scratch /scratch/WIEN2k/ -noco
> >>
> >> or only    nmr -case ...
> >>
> >> In any case, it is running correctly.
> >>
> >> PS: I know that also the current step needs a lot of memory, after
> >> all it needs to read the eigenvectors of all eigenvalues, ...
> >>
> >> PPS:   -quota 8 (or 24)  might help and still utilizing all cores,
> >> but I'm not sure if it would save enough memory in the current steps.
> >>
> >>
> >>
> >> Am 12.05.2024 um 10:09 schrieb Michael Fechtelkord via Wien:
> >>> Hello all, hello Peter,
> >>>
> >>>
> >>> That is what is really running in the background (from htop: this is
> >>> a new job with 4 nodes but it was the same with 8 nodes -p 1 - 8),
> >>> so no nmr_mpi.
> >>>
> >>>
> >>> TIME+ Command
> >>>
> >>> 96.0 14.9 19h06:05 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode
> >>> current -green -scratch /scratch/WIEN2k/ -noco -p 3
> >>>
> >>> 95.8 14.9 19h05:10 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode
> >>> current -green -scratch /scratch/WIEN2k/ -noco -p 1
> >>>
> >>> 95.1 14.9 19h06:00 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode
> >>> current -green -scratch /scratch/WIEN2K/ -noco -p 2
> >>>
> >>> 95.5 15.4 19h08:10 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode
> >>> current -green -scratch /scratch/WIEN2k/ -noco -p 4
> >>>
> >>> 94.6 14.9 18h35:33 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode
> >>> current -green -scratch /scratch/WIEN2k/ -noco -p 3
> >>>
> >>> 93.3 15.4 18h36:24 /usr/local/WIEN2k/nmr-case MS_2M1_Al2 -mode
> >>> current -green -scratch /scratch/WIEN2k/ -noco -p 4
> >>>
> >>> 93.3 14.9 18h33:02 /usr/local/WIEN2k/nmr-case MS_2M1_A12 -mode
> >>> current -green -scratch/scratch/WIEN2k/ -noco -p2
> >>>
> >>> 94.0 14.9 18h38:44 /usr/local/WIEN2k/nmr -case MS_2M1_A12 -mode
> >>> current -green -scratch /scratch/WIEN2k/ -noco -p 1
> >>>
> >>>
> >>> Regards,
> >>>
> >>> Michael
> >>>
> >>>
> >>> Am 11.05.2024 um 20:10 schrieb Michael Fechtelkord via Wien:
> >>>> Hello Peter,
> >>>>
> >>>>
> >>>> I just use "x_nmr_lapw -p" and the rest is initiated by the nmr
> >>>> script. The Line "/usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode
> >>>> current -green         -scratch /scratch/WIEN2k/ -noco " is just
> >>>> part of the whole procedure and not initiated by me manually.. (I
> >>>> only copied the last lines of the calculation).
> >>>>
> >>>>
> >>>> Best regards,
> >>>>
> >>>> Michael
> >>>>
> >>>>
> >>>> Am 11.05.2024 um 18:08 schrieb Peter Blaha:
> >>>>> Hallo Michael,
> >>>>>
> >>>>> I don't understand the line:
> >>>>>
> >>>>> /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode current
> >>>>> -green         -scratch /scratch/WIEN2k/ -noco
> >>>>>
> >>>>> The mode current should run only k-parallel, not in mpi ??
> >>>>>
> >>>>> PS: The repetition of
> >>>>>
> >>>>> nmr_integ:localhost    is useless.
> >>>>>
> >>>>> nmr mode integ runs only once (not k-parallel, sumpara has already
> >>>>> summed up the currents)
> >>>>>
> >>>>> But one can use       nmr_integ:localhost:8
> >>>>>
> >>>>>
> >>>>> Best regards
> >>>>>
> >>>>> Am 11.05.2024 um 16:19 schrieb Michael Fechtelkord via Wien:
> >>>>>> Hello Peter,
> >>>>>>
> >>>>>> this is the .machines file content:
> >>>>>>
> >>>>>> granulartity:1
> >>>>>> omp_lapw0:8
> >>>>>> omp_global:2
> >>>>>> 1:localhost
> >>>>>> 1:localhost
> >>>>>> 1:localhost
> >>>>>> 1:localhost
> >>>>>> 1:localhost
> >>>>>> 1:localhost
> >>>>>> 1:localhost
> >>>>>> 1:localhost
> >>>>>> nmr_integ:localhost
> >>>>>> nmr_integ:localhost
> >>>>>> nmr_integ:localhost
> >>>>>> nmr_integ:localhost
> >>>>>> nmr_integ:localhost
> >>>>>> nmr_integ:localhost
> >>>>>> nmr_integ:localhost
> >>>>>> nmr_integ:localhost
> >>>>>>
> >>>>>>
> >>>>>> Best regards,
> >>>>>>
> >>>>>> Michael
> >>>>>>
> >>>>>>
> >>>>>> Am 11.05.2024 um 14:58 schrieb Peter Blaha:
> >>>>>>> Hmm. ?
> >>>>>>>
> >>>>>>> Are you using   k-parallel  AND  mpi-parallel ??  This could
> >>>>>>> overload the machine.
> >>>>>>>
> >>>>>>> How does the .machines file look like ?
> >>>>>>>
> >>>>>>>
> >>>>>>> Am 10.05.2024 um 18:15 schrieb Michael Fechtelkord via Wien:
> >>>>>>>> Dear all,
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> the following problem occurs to me using the NMR part of WIEN2k
> >>>>>>>> (23.2) on a opensuse LEAP 15.5 Intel platform. WIEN2k was
> >>>>>>>> compiled using one-api 2024.1 ifort and gcc 13.2.1. I am using
> >>>>>>>> ELPA 2024.03.01, Libxc 6.22, fftw 3.3.10 and MPICH 4.2.1 and
> >>>>>>>> the one-api 2024.1 MKL libraries. The CPU is a I9 14900k with
> >>>>>>>> 24 cores where I use eight for the calculations. The RAM is 130
> >>>>>>>> Gb and a swap file of 16 GB on a Samsung PCIE 4.0 NVME SSD. The
> >>>>>>>> BUS width is 5600 MT / s.
> >>>>>>>>
> >>>>>>>> The structure is a layersilicate and to simulate the ratio of
> >>>>>>>> Si:Al = 3:1 I use a 1:1:2 supercell currently. The monoclinic
> >>>>>>>> symmetry of the new structure (original is C 2/c) is P 2/c and
> >>>>>>>> contains 40 atoms (K, Al, Si, O, and F).
> >>>>>>>>
> >>>>>>>> I use 3 NMR LOs for K and O and 10 for Si, Al, and F (where I
> >>>>>>>> need the chemical shifts). The k mesh is 40k points.
> >>>>>>>>
> >>>>>>>> The interesting thing is that the RAM is sufficient during NMR
> >>>>>>>> vector calculations (always under 100 Gb RAM occupied) and at
> >>>>>>>> the beginning of the electron current calculation. However, the
> >>>>>>>> RAM use increases to a critical point in the calculation and
> >>>>>>>> more and more data is outsourced into the SWAP File which is
> >>>>>>>> sometimes 80% occupied.
> >>>>>>>>
> >>>>>>>> As you see this time only one core failed because of memory
> >>>>>>>> overflow. But using 48k points 3 cores crashed and so the whole
> >>>>>>>> current calculation. The reason is of the crash clear to me.
> >>>>>>>> But I do not understand, why the current calculation reacts so
> >>>>>>>> sensitive with so few atoms and a small k mesh. I made
> >>>>>>>> calculations with more atoms and a 1000K point mesh on 4 cores
> >>>>>>>> .. they worked fine. So can it be that the Intel MKL library is
> >>>>>>>> the source of failure? So I better get back to 4 cores, even
> >>>>>>>> with longer calculation times?
> >>>>>>>>
> >>>>>>>> Have all a nice weekend!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Best wishes from
> >>>>>>>>
> >>>>>>>> Michael Fechtelkord
> >>>>>>>>
> >>>>>>>> -----------------------------------------------
> >>>>>>>>
> >>>>>>>> cd ./  ...  x lcore  -f MS_2M1_Al2
> >>>>>>>>  CORE  END
> >>>>>>>> 0.685u 0.028s 0:00.71 98.5%     0+0k 2336+16168io 5pf+0w
> >>>>>>>>
> >>>>>>>> lcore        ....  ready
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>  EXECUTING:     /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2
> >>>>>>>> -mode current    -green         -scratch /scratch/WIEN2k/ -noco
> >>>>>>>>
> >>>>>>>> [1] 20253
> >>>>>>>> [2] 20257
> >>>>>>>> [3] 20261
> >>>>>>>> [4] 20265
> >>>>>>>> [5] 20269
> >>>>>>>> [6] 20273
> >>>>>>>> [7] 20277
> >>>>>>>> [8] 20281
> >>>>>>>> [8]  + Abgebrochen                   ( cd $dir; $exec2 >>
> >>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
> >>>>>>>> [7]  + Fertig                        ( cd $dir; $exec2 >>
> >>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
> >>>>>>>> [6]  + Fertig                        ( cd $dir; $exec2 >>
> >>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
> >>>>>>>> [5]  + Fertig                        ( cd $dir; $exec2 >>
> >>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
> >>>>>>>> [4]  + Fertig                        ( cd $dir; $exec2 >>
> >>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
> >>>>>>>> [3]  + Fertig                        ( cd $dir; $exec2 >>
> >>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
> >>>>>>>> [2]  + Fertig                        ( cd $dir; $exec2 >>
> >>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
> >>>>>>>> [1]  + Fertig                        ( cd $dir; $exec2 >>
> >>>>>>>> nmr.out.${loop} ) >& nmr.err.$loop
> >>>>>>>>
> >>>>>>>>  EXECUTING:     /usr/local/WIEN2k/nmr -case MS_2M1_Al2 -mode
> >>>>>>>> sumpara  -p 8    -green -scratch /scratch/WIEN2k/
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> current        ....  ready
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>  EXECUTING:     mpirun -np 1 -machinefile .machine_nmrinteg
> >>>>>>>> /usr/local/WIEN2k/nmr_mpi -case MS_2M1_Al2 -mode integ -green
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> nmr:  integration  ... done in   4032.3s
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> stop
> >>>>>>>>
> >>
> > _______________________________________________
> > Wien mailing list
> > Wien at zeus.theochem.tuwien.ac.at
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> > SEARCH the MAILING-LIST at:
> > http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
> --
> Dr. Michael Fechtelkord
>
> Institut für Geologie, Mineralogie und Geophysik
> Ruhr-Universität Bochum
> Universitätsstr. 150
> D-44780 Bochum
>
> Phone: +49 (234) 32-24380
> Fax:  +49 (234) 32-04380
> Email: Michael.Fechtelkord at ruhr-uni-bochum.de
> Web Page:
> https://www.ruhr-uni-bochum.de/kristallographie/kc/mitarbeiter/fechtelkord/
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>


-- 
Professor Laurence Marks (Laurie)
Northwestern University
Webpage <http://www.numis.northwestern.edu> and Google Scholar link
<http://scholar.google.com/citations?user=zmHhI9gAAAAJ&hl=en>
"Research is to see what everybody else has seen, and to think what nobody
else has thought", Albert Szent-Györgyi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20240513/35c8b170/attachment.htm>


More information about the Wien mailing list