[Wien] Summary NFS bug (was: weird parallel problem)

Sun Dec 3 21:50:38 CET 2006

Dear wien2k colleagues,

A few weeks ago, I've asked your help about a problem for parallel  
execution (strange and inconsistent behaviour for k-point parallel  
spin-orbit plus LDA+U with ifort 8&9). Thanks for al your suggestions.  
It turned out that we were suffering from the NFS bug (and did not  
notice that before for reasons you will see below).

Some lessons we learned are useful for general use:

1) The fixes that were introduced in lapw1para and lapw2para to  
remediate the NFS bug work perfectly well for normal LDA/GGA  
calculations.

2) There are no such fixes for spin-orbit coupling and LDA+U. If your  
computer suffers from the NFS bug, and if you compile with ifort 8 or  
9, then k-point parallel runs involving spin-orbit will somewhat  
deviate from a correct result, while spin-orbit plus LDA+U will  
severely deviate. The calculations proceed apparently without problems  
(no crashes or error messages), but the final result will not be what  
you want.

3) Be aware that the problems might be very subtle. We had cases where  
the total energy was identical in a correct and bad run, but EFG's  
were different by a factor of 2.

4) The problem does not appear at all if you stick to ifc7.1 (combined  
with mkl8.1 this gives a speed that is not bad).

5) We upgraded to the 2.6.11.4-21.9-smp kernel first (Suse 9.3), and  
the problem did not disappear. Then the 2.6.16.13-4-smp kernel was  
tried (Suse 10.1), and there the problem has gone. Hence, somewhere in  
between these two versions the bug got fixed.

6) How to verify whether you have this NFS-bug? (older recipes on the  
mailing list do not apply any more, as the fixes mask the buggy  
behaviour)

* Try something with spin-orbit + LDA+U, to maximize the effect  
(hypothetical bcc-U, for instance)
* Do a k-point parallel run on 2 machines, and compare the history of  
:DIS with an identical serial run. If differences are present that  
exceed numerical noise (=worse than last two digits), then you've got  
it.
* If you converge the parallel run on two machines, and restart it on  
two different machines, then from the second iteration on the :DIS  
value is *very* high (0.xx, no guarantee that this is the behaviour in  
all possible environments, but that's what we saw).
* If you still have an ifc7.1 around, you could check differences  
between a serial run (any compiler), ifc7.1 (serial and parallel) and  
ifort 8 or 9 (parallel). If you have the bug, the latter will be  
different from any of the former.

7) for the AUTHORS: during the various tests, some minor things  
appeared that might be considered in further updates:

@@@ in lapw2para:
Consider 'touch $case.weigh${updn}_1 (...)' and 'rm  
$case.weigh${updn}_* (...)' in the NFS bug fix lines (was mentioned in  
the ML some months ago, but not yet  introduced in the updates)

@@@ In SRC_aim/d1mach.c, an extra line

#include <stdlib.h>

could be useful (some c-compilers fail on this).

@@@ Rather unrelated, but including vsp and vorb files in the  
save_lapw and restore_lapw commands would be useful (if not, there are  
circumstances in which an unnecessarily high :DIS appears in the first  
iterations after a restore).

That's all...

Stefaan

Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm