[Wien] lapw2 QTL-B crash with MPI, but not with k-parallel

Fri Jun 13 15:09:54 CEST 2008

This is one of many issues.

1) For mkl 10 make sure that you are using version 10.0.3, the earlier
versions of 10.X had some bugs.

2) Make sure that you do not have a problem in your network software.
I have a new cluster on which the "official" version of mvapich was
installed, and this had a scalapack bug. Their current version (via
their equivalent of cvs) works well. For you check the openmpi
webpage.

3) For mkl 10 there are some issues with the size of buffer arrays; in
essence unless one uses sizes at least those that the Intel code
"likes" (via a workspace query call), problems can occur. I think that
this is an Intel bug, they probably call it a "feature". While this is
probably not a problem for real cases (because of some code changes)
and non-iterative calculations, it may still be in the current version
on the web for complex iterative cases.

4) In the iterative versions only a subset of the eigenvectors from a
previous iteration are used. If the space of these old eigenvectors
does not include a good approximation to a new eigenvalue you may get
ghost-bands (QTL-B errors). One workaround is to use more old
eigenvectors, i.e. increase nband at the bottom of case.in1 or
case.in1c.

5) If 4) does not work (it does not always), consider using LAPW for
some of the states. For instance, with relatively large RMT's (2.0)
for d-electron transition elements (e.g. Ni) switching to LAPW rather
than APW+lo for the d's stabilized the iterative mode for some
calculations.

On Fri, Jun 13, 2008 at 2:47 AM, Johan Eriksson <joher at ifm.liu.se> wrote:
> Dear Wien community,
> I'm running the latest Wien2k release on a linux cluster. IFORT 10.1,
> cmkl 9.1, openmpi 1.2.5).
> The cases are running fine with k-point parallelization + MPI lapw0.
> However, since there are many more cpus than k-points and infiniband
> interconnects I want to use full MPI parallelization. First I ran my
> case with k-point parallel for a few cycles, stopped, ran clean_lapw and
> then switched to MPI. After a few iterations I started getting QTL-B
> warnings and it crash. If I switch back to k-point parallel it runs just
> fine again.
> What am I doing wrong here? Could it be that I'm using the iterative
> diagonalization scheme (-it switch)? Should I try some other mkl och MPI
> implementation?
>
> Also, why is it that the serial benchmark 'x lapw1 -c' is so unstable
> with mkl 10 then using OMP_NUM_THREADS>=4? With cmkl 9.1 it works fine
> with 1,2,4 and 8 threads. When mkl 10 works it is however faster than
> cmkl 9.1.
>
>
>
> /Johan Eriksson
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>

-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Commission on Electron Diffraction of IUCR
www.numis.northwestern.edu/IUCR_CED