[Wien] Wien post from pascal.boulet at univ-amu.fr (Errors with DGEMM and hybrid calculations)
Tran, Fabien
fabien.tran at tuwien.ac.at
Sat Aug 15 11:40:13 CEST 2020
Hi,
For calculations on small cells with many k-points it is more preferable (for speediness) to use k-point paralllelization instead of MPI parallelization. And, as mentioned by PB, MPI applied to small matrices may not work.
Of course, if the number of cores that you have at disposal is larger (twice for instance) than the number of k-points in the IBZ, then you can combine the k-point and MPI parallelizations (two cores for each k-point).
An example of .machines file for k-point parallelization is (supposing that you have 6 k-points in the IBZ and want to use one machine having 6 cores):
lapw0: n1071 n1071 n1071 n1071 n1071 n1071
dstart: n1071 n1071 n1071 n1071 n1071 n1071
nlvdw: n1071 n1071 n1071 n1071 n1071 n1071
1:1071
1:1071
1:1071
1:1071
1:1071
1:1071
granularity:1
extrafine:1
The six lines "1:1071" mean that lapw1, lapw2 and hf are k-point (no MPI) parallelized (one line fore each core). lapw0, dstart and nlvdw are MPI parallelized. In this example, the omp parallelization is ignored by supposing that OMP_NUM_THREADS is set to 1.
Besides, I find your computational time with HF as very large. What is the number of atoms in the cell, the number of k-points (plz, specify th n1xn2nx3 k-mesh), RKmax, etc.?
Best,
FT
From: Wien <wien-bounces at zeus.theochem.tuwien.ac.at> on behalf of pboulet <pascal.boulet at univ-amu.fr>
Sent: Saturday, August 15, 2020 10:35 AM
To: A Mailing list for WIEN2k users
Subject: Re: [Wien] Wien post from pascal.boulet at univ-amu.fr (Errors with DGEMM and hybrid calculations)
Dear Peter,
Thank you for your response. It clarifies some points for me.
I have run another calculation for Mg2Si, which is a small system, on 12 cores (no MKL errors!). The job ran for 12 hours (CPU time limit I set) and made only 3 SCF cycles without converging.
The .machines file I use looks like this:
1:1071:12
lapw0: n1071 n1071
dstart: n1071 n1071
nlvdw: n1071 n1071
granularity:1
extrafine:1
I guess I am not optimising the number of cores w.r.t. the size of the problem (72 k-points, 14 HF bands, 12 occupied +2 unoccupied).
I changed the number of processors for 72, hoping for a 1 k-point/core parallelisation and commenting all the lines of .machines except granularity and extrafine. I got less than 1 cycle in 12 hours.
What should I do to run the HF part on k-points parallelisation only (no mpi)? This point that is not clear for me from the manual.
Thank you
Best regards
Pascal
Pascal Boulet
—
Professor in computational materials - DEPARTMENT OF CHEMISTRY
University of Aix-Marseille - Avenue Escadrille Normandie Niemen - F-13013 Marseille - FRANCE
Tél: +33(0)4 13 55 18 10 - Fax : +33(0)4 13 55 18 50
Email : pascal.boulet at univ-amu.fr
Le 12 août 2020 à 14:12, Peter Blaha <pblaha at theochem.tuwien.ac.at> a écrit :
Your message is too big to be accepted.
Anyway, the DGEMM messages seem to be a relict of the mkl you are using, and most likely is related to the use of too many mpi-cores for such a small matrix. At least when I continue your Mg2Si calculations (in k-parallel mode) the :DIS and :ENE are continuous, which means that the previous results are ok.
Concerning hf, I don't know. Again, running this in sequential (k-point parallel) mode is no problems and converges quickly.
I suggest that you change your setup to a k-parallel run for such small systems.
Best regards
Peter Blaha
---------------------------------------------------------------------
Subject:
Errors with DGEMM and hybrid calculations
From:
pboulet <pascal.boulet at univ-amu.fr>
Date:
8/11/20, 6:31 PM
To:
A Mailing list for WIEN2k users <wien at zeus.theochem.tuwien.ac.at>
Dear all,
I have a strange problem with LAPACK. I get an error message with wrong parameters sent to DGEMM, but still wien2k (19.2) seems to converge the scf. Is that possible? What could be the "problem"?
I have attached an archive containing the summary of the SCF + compilation options + SLURM output file. The error message is in the dayfile.
The same error shows up with Wien2k 18.1.
Actually this case is a test case for testing hybrid calculations as I have problems with my real case, which is found to be metallic with PBE. At least Mg2Si is a small band gap semiconductor.
When I go ahead with the HSE06 functional and Mg2Si I get a different error: segmentation fault during the hf run. As Mg2Si is a small system I guess this is not a memory problem: the node is 128GB.
Note that the same problem occurs for my real case file, but the LAPACK problem does not occur.
The second archive contains some files related to the hybrid calculation.
Some hints would be welcome as I am completely lost in these (unrelated) errors!
Thank you.
Best regards,
Pascal
Pascal Boulet
--
P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at WIEN2k: http://www.wien2k.at
WWW: http://www.imc.tuwien.ac.at/TC_Blaha
--------------------------------------------------------------------------
_______________________________________________
Wien mailing list
Wien at zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
More information about the Wien
mailing list