[Wien] Segmentation error - Memory allocation for each processor

Peter Blaha peter.blaha at tuwien.ac.at
Mon Jan 27 18:17:03 CET 2025


i) You have a 16 core processor, so more than 16 parallel jobs are 
useless and even reduce the performance.

ii) You should really read the parallelization section in the usersguide.

ii) You have to learn what the syntax in   .machines really means:

One line:
47:localhost:24
means you are using mpi-parallel job (did you link with ELPA ??? 
otherwise the mpi job is pretty slow) on 24 cores.  For your processor, 
you should use at most 16 cores.

The "47" has NOTHING to do here (nothing to do with k-points,...), but 
is a relative speed indicator when you would run k-parallel on 2 nodes 
of different speed. Leave it at 1 for you.

So the .machines for your processor should look as:

1:localhost:16
omp_lapw1:1
omp_lapw2:1


iii) You have matrix size 248, this means you have to setup and solve a 
248x248 matrix.  In mpi-parallel this is decomposed in 16 (when you have 
16 cores) sub matrices, each one with dimensions 62x62. This is MUCH TOO 
SMALL to run efficiently in mpi mode.
For such a case one should use a mixed k-point and OpenMP 
parallelization with a .machines file like:
1:localhost
1:localhost
1:localhost
1:localhost
omp_global:4

It will span 4 k-parallel jobs and each one will use 4 cores.

This is also the best configuration for your bigger case, which is still 
too small (matrix size 968) for mpi.

The segmentation fault in your case has nothing to do with memory, as 
you can see in your output1 files (16 Mb,....), and the crash happens in 
lapw2, probably due to overloading the processor or a mpi problem 
because the case is too small.

for 64 and 128 atom supercells, mpi parallelization may become useful.

-------------
iv) Checking your runs:
While it is running, use:    top
it gives you the memory and shows how many cores are used for each job 
(with the example above you should see   4 lapw1 executables with 400% 
cpu usage.)

Also while running, view   case.dayfile. you should see the cpu and wall 
time of each job step.


> using 1. Fe2VAl symmetry based structure (Fm-3m, 225, 3 equivalent 
> atomic positions), 2. Fe2VAl conventional unitcell (P1, with 16 non- 
> equivalent atomic positions). I have used following system configuration 
> for these calculations,

> ================   1. Fe2VAl symmetry based structure (Fm-3m, 225, 3 
> equivalent atomic positions) ================
> 
> runsp_lapw -p -ec 0.00001
> 
> Converged at 26th cycle
> 
> Total time taken for 26 cycles : 9 mins
> 
> 47 k-points
> 
> ++++  .machines (run at 24 processors)++++++
> 
> 47:localhost:24
> granularity:1
> extrafine:1
> 
> venkatesh at venkatesh-PC:~/wiendata/FVA$ grep "Matrix size" *output1* -A18
> 
> FVA.output1up_1: Matrix size          248
> FVA.output1up_1-Optimum Blocksize for setup**** Excess %  0.100D+03
> FVA.output1up_1-Optimum Blocksize for diag  22 Excess %  0.115D+02
> FVA.output1up_1-Base Blocksize   64 Diagonalization   32
> FVA.output1up_1-          allocate H         0.0 MB          dimensions 
>     64    64
> FVA.output1up_1-          allocate S         0.0 MB          dimensions 
>     64    64
> FVA.output1up_1-     allocate spanel         0.0 MB          dimensions 
>     64    64
> FVA.output1up_1-     allocate hpanel         0.0 MB          dimensions 
>     64    64
> FVA.output1up_1-   allocate spanelus         0.0 MB          dimensions 
>     64    64
> FVA.output1up_1-       allocate slen         0.0 MB          dimensions 
>     64    64
> FVA.output1up_1-         allocate x2         0.0 MB          dimensions 
>     64    64
> FVA.output1up_1-   allocate legendre         0.4 MB          dimensions 
>     64    13    64
> FVA.output1up_1-allocate al,bl (row)         0.0 MB          dimensions 
>     64    11
> FVA.output1up_1-allocate al,bl (col)         0.0 MB          dimensions 
>     64    11
> FVA.output1up_1-         allocate YL         0.0 MB          dimensions 
>     15    64     2
> FVA.output1up_1- number of local orbitals, nlo (hamilt)       44
> FVA.output1up_1-       allocate YL           0.1 MB          dimensions 
>     15   248     2
> FVA.output1up_1-       allocate phsc         0.0 MB          dimensions 
>    248
> FVA.output1up_1-Time for al,bl    (hamilt, cpu/wall) :         0.00     
>     0.00
> 
> 
> 
> ================   2. A. Fe2VAl conventional unitcell (P1, with 16 non- 
> equivalent atomic positions) FAILED ================
> 
> runsp_lapw -p -ec 0.00001
> 
> 32 k-points
> 
> ++++  .machines (run at 24 processors)++++++
> 
> 
> 32:localhost:24
> granularity:1
> extrafine:1
> 
> +++++++++++++++++++
> in cycle 15    ETEST: .0071660350000000   CTEST: .2687245   STRTEST 2.59
>   LAPW0 END
>   LAPW1 END
> [2]  - Done                          ( cd $PWD; $t $exe ${def} 
> _$loop.def; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
>   LAPW1 END
> [1]    Done                          ( cd $PWD; $t $ttt; rm - 
> f .lock_$lockfile[$p] ) >> .time1_$loop
>   LAPW1 END
> [2]  - Done                          ( cd $PWD; $t $exe ${def} 
> _$loop.def; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
>   LAPW1 END
> [1]    Done                          ( cd $PWD; $t $ttt; rm - 
> f .lock_$lockfile[$p] ) >> .time1_$loop
> LAPW2 - FERMI; weights written
>   LAPW2 END
>   LAPW2 END
> [2]  + Done                          ( cd $PWD; $t $exe ${def} 
> _${loop}.def $loop; rm -f .lock_$lockfile[$p] ) >> .time2_$loop
> [1]  + Done                          ( cd $PWD; $t $ttt $vector_split; 
> rm -f .lock_$lockfile[$p] ) >> .time2_$loop
>   SUMPARA END
> LAPW2 - FERMI; weights written
> Segmentation fault
> 
> +++++++++++++++++++
> 
> FVA_1.output1up_1: Matrix size          968
> FVA_1.output1up_1-Optimum Blocksize for setup  82 Excess %  0.291D+01
> FVA_1.output1up_1-Optimum Blocksize for diag  18 Excess %  0.413D+01
> FVA_1.output1up_1-Base Blocksize   64 Diagonalization   32
> FVA_1.output1up_1-          allocate H         0.8 MB         
>   dimensions   192   256
> FVA_1.output1up_1-          allocate S         0.8 MB         
>   dimensions   192   256
> FVA_1.output1up_1-     allocate spanel         0.2 MB         
>   dimensions   192    64
> FVA_1.output1up_1-     allocate hpanel         0.2 MB         
>   dimensions   192    64
> FVA_1.output1up_1-   allocate spanelus         0.2 MB         
>   dimensions   192    64
> FVA_1.output1up_1-       allocate slen         0.1 MB         
>   dimensions   192    64
> FVA_1.output1up_1-         allocate x2         0.1 MB         
>   dimensions   192    64
> FVA_1.output1up_1-   allocate legendre         1.2 MB         
>   dimensions   192    13    64
> FVA_1.output1up_1-allocate al,bl (row)         0.1 MB         
>   dimensions   192    11
> FVA_1.output1up_1-allocate al,bl (col)         0.0 MB         
>   dimensions    64    11
> FVA_1.output1up_1-         allocate YL         0.0 MB         
>   dimensions    15   192     1
> FVA_1.output1up_1- number of local orbitals, nlo (hamilt)      176
> FVA_1.output1up_1-       allocate YL           0.2 MB         
>   dimensions    15   968     1
> FVA_1.output1up_1-       allocate phsc         0.0 MB         
>   dimensions   968
> FVA_1.output1up_1-Time for al,bl    (hamilt, cpu/wall) :         0.00   
>       0.00
> 
> 
> 
> ================   2. B. Fe2VAl conventional unitcell (P1, with 16 non- 
> equivalent atomic positions) SUCCESSFULLY COMPLETED ================
> 
> runsp_lapw -p -ec 0.00001
> 
> Converged at 43rd cycle
> 
> Total time taken for 26 cycles :  61 mins
> 
> 32 k-points
> 
> ++++  .machines (run at 4 processors)++++++
> 
> 8:localhost
> 8:localhost
> 8:localhost
> 8:localhost
> granularity:1
> 
> +++++++++++++++++++++++
> 
> venkatesh at venkatesh-PC:~/wiendata/FVA_1$ grep "Matrix size" *output1* -A18
> 
> 
> 
> FVA_1.output1up_4: Matrix size          968
> FVA_1.output1up_4-         allocate HS        14.3 MB
> FVA_1.output1up_4-         allocate Z         14.3 MB
> FVA_1.output1up_4-     allocate spanel         1.9 MB         
>   dimensions   968   128
> FVA_1.output1up_4-     allocate hpanel         1.9 MB         
>   dimensions   968   128
> FVA_1.output1up_4-   allocate spanelus         1.9 MB         
>   dimensions   968   128
> FVA_1.output1up_4-       allocate slen         0.9 MB         
>   dimensions   968   128
> FVA_1.output1up_4-         allocate x2         0.9 MB         
>   dimensions   968   128
> FVA_1.output1up_4-   allocate legendre        12.3 MB         
>   dimensions   968    13   128
> FVA_1.output1up_4-allocate al,bl (row)         0.3 MB         
>   dimensions   968    11
> FVA_1.output1up_4-allocate al,bl (col)         0.0 MB         
>   dimensions   128    11
> FVA_1.output1up_4-         allocate YL         0.2 MB         
>   dimensions    15   968     1
> FVA_1.output1up_4- number of local orbitals, nlo (hamilt)      176
> FVA_1.output1up_4-       allocate YL           0.2 MB         
>   dimensions    15   968     1
> FVA_1.output1up_4-       allocate phsc         0.0 MB         
>   dimensions   968
> FVA_1.output1up_4-Time for al,bl    (hamilt, cpu/wall) :         0.01   
>       0.01
> FVA_1.output1up_4-Time for legendre (hamilt, cpu/wall) :         0.03   
>       0.03
> FVA_1.output1up_4-Time for phase    (hamilt, cpu/wall) :         0.08   
>       0.08
> FVA_1.output1up_4-Time for us       (hamilt, cpu/wall) :         0.11   
>       0.12
> 
> 
> 
> I need a few clarifications on the memory used by each processor in 
> order to avoid any Segmentation fault error (as shown in case 2.A).
> 
> 1. I have got a segmentation error for the 16 atomic calculation (2.A) 
> with 24 processors and repeating the calculation with the same .machines 
> file sometimes lead to hanging at lapw1 calculation for a given cycle 
> more than 50 minutes (I killed the process manually to stop the 
> calculation). I hope this is due to the fact that memory allocation in 
> each processor is not sufficient while calculations are going on. 
> However, I am using 128 GB RAM and why the memory is not properly 
> allocated for this case. Did I get any clue from the MatrixSize details 
> specified from case 2.A
> 
> 
> 2. Now as shown in case 2.B, a change in .machines file worked without 
> any Segmentation error using 4 processors only. By comparing the 
> MatrixSize details of 3 cases (1, 2.A & 2.B), Can someone suggest to me 
> how I can tune the .machines files so that each processor can have more 
> memory allocation and I can use more cores (with more memory allocation) 
> for speeding up the calculation.
> 
> 
> 3. My goal was to run the calculations for 64 & 128 atoms of 
> conventional unitcell (even it take more time) without Segmentation 
> error. Therefore, I need a clarification on how to increase the memory 
> allocation for each processor using 128 RAM available on my PC. Hence, 
> Please suggest to me how to improve my memory allocation for each 
> processor in order to run calculations for bigger unit cells.
> 
> 
> Thanks in advance for your help and let me know if you need any further 
> information on the details of calculations.
> 
> Regards,
> Venkat
> Physics Department,
> IISc Bangalore, India.
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

-- 
-----------------------------------------------------------------------
Peter Blaha,  Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-158801165300
Email: peter.blaha at tuwien.ac.at
WWW:   http://www.imc.tuwien.ac.at      WIEN2k: http://www.wien2k.at
-------------------------------------------------------------------------



More information about the Wien mailing list