[Wien] lapw2 QTL-B crash with MPI, but not with k-parallel

Tue Jun 17 12:12:00 CEST 2008

Thanks you for the tips.
I changed to MKL 10.0.3 and the threading problem disappeared. No 
problem to run 8 threads anymore.
Regarding the QTL-B-errors, I tried running with MPI parallelization but 
without the iterative diagonalization. This seems to work, no QTL-B 
errors. Another (smaller) case I tried ran without any problem using MPI 
+ iterative diag.
 From a code design point of view, should there be any difference 
between lapwX and lapwX_mpi due to different parameters such as NMATMAX, 
which could give rise to these QTL-B errors? I should mention that 
NMATMAX is not limiting in my case.
You mention the iterative scheme uses a subset of eigenvectors from a 
previous iteration. Is this subset of old eigenvectors smaller or 
different when using MPI compared to k-parallel?

To be on the safe side I will run using MPI but without the '-it' switch 
in the future.

/Johan

Laurence Marks wrote:
> This is one of many issues.
>
> 1) For mkl 10 make sure that you are using version 10.0.3, the earlier
> versions of 10.X had some bugs.
>
> 2) Make sure that you do not have a problem in your network software.
> I have a new cluster on which the "official" version of mvapich was
> installed, and this had a scalapack bug. Their current version (via
> their equivalent of cvs) works well. For you check the openmpi
> webpage.
>
> 3) For mkl 10 there are some issues with the size of buffer arrays; in
> essence unless one uses sizes at least those that the Intel code
> "likes" (via a workspace query call), problems can occur. I think that
> this is an Intel bug, they probably call it a "feature". While this is
> probably not a problem for real cases (because of some code changes)
> and non-iterative calculations, it may still be in the current version
> on the web for complex iterative cases.
>
> 4) In the iterative versions only a subset of the eigenvectors from a
> previous iteration are used. If the space of these old eigenvectors
> does not include a good approximation to a new eigenvalue you may get
> ghost-bands (QTL-B errors). One workaround is to use more old
> eigenvectors, i.e. increase nband at the bottom of case.in1 or
> case.in1c.
>
> 5) If 4) does not work (it does not always), consider using LAPW for
> some of the states. For instance, with relatively large RMT's (2.0)
> for d-electron transition elements (e.g. Ni) switching to LAPW rather
> than APW+lo for the d's stabilized the iterative mode for some
> calculations.
>
> On Fri, Jun 13, 2008 at 2:47 AM, Johan Eriksson <joher at ifm.liu.se> wrote:
>   
>> Dear Wien community,
>> I'm running the latest Wien2k release on a linux cluster. IFORT 10.1,
>> cmkl 9.1, openmpi 1.2.5).
>> The cases are running fine with k-point parallelization + MPI lapw0.
>> However, since there are many more cpus than k-points and infiniband
>> interconnects I want to use full MPI parallelization. First I ran my
>> case with k-point parallel for a few cycles, stopped, ran clean_lapw and
>> then switched to MPI. After a few iterations I started getting QTL-B
>> warnings and it crash. If I switch back to k-point parallel it runs just
>> fine again.
>> What am I doing wrong here? Could it be that I'm using the iterative
>> diagonalization scheme (-it switch)? Should I try some other mkl och MPI
>> implementation?
>>
>> Also, why is it that the serial benchmark 'x lapw1 -c' is so unstable
>> with mkl 10 then using OMP_NUM_THREADS>=4? With cmkl 9.1 it works fine
>> with 1,2,4 and 8 threads. When mkl 10 works it is however faster than
>> cmkl 9.1.
>>
>>
>>
>> /Johan Eriksson
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>>     
>
>
>
>