[Wien] Summary NFS bug (was: weird parallel problem)
Stefaan Cottenier
Stefaan.Cottenier at fys.kuleuven.be
Sun Dec 3 21:50:38 CET 2006
Dear wien2k colleagues,
A few weeks ago, I've asked your help about a problem for parallel
execution (strange and inconsistent behaviour for k-point parallel
spin-orbit plus LDA+U with ifort 8&9). Thanks for al your suggestions.
It turned out that we were suffering from the NFS bug (and did not
notice that before for reasons you will see below).
Some lessons we learned are useful for general use:
1) The fixes that were introduced in lapw1para and lapw2para to
remediate the NFS bug work perfectly well for normal LDA/GGA
calculations.
2) There are no such fixes for spin-orbit coupling and LDA+U. If your
computer suffers from the NFS bug, and if you compile with ifort 8 or
9, then k-point parallel runs involving spin-orbit will somewhat
deviate from a correct result, while spin-orbit plus LDA+U will
severely deviate. The calculations proceed apparently without problems
(no crashes or error messages), but the final result will not be what
you want.
3) Be aware that the problems might be very subtle. We had cases where
the total energy was identical in a correct and bad run, but EFG's
were different by a factor of 2.
4) The problem does not appear at all if you stick to ifc7.1 (combined
with mkl8.1 this gives a speed that is not bad).
5) We upgraded to the 2.6.11.4-21.9-smp kernel first (Suse 9.3), and
the problem did not disappear. Then the 2.6.16.13-4-smp kernel was
tried (Suse 10.1), and there the problem has gone. Hence, somewhere in
between these two versions the bug got fixed.
6) How to verify whether you have this NFS-bug? (older recipes on the
mailing list do not apply any more, as the fixes mask the buggy
behaviour)
* Try something with spin-orbit + LDA+U, to maximize the effect
(hypothetical bcc-U, for instance)
* Do a k-point parallel run on 2 machines, and compare the history of
:DIS with an identical serial run. If differences are present that
exceed numerical noise (=worse than last two digits), then you've got
it.
* If you converge the parallel run on two machines, and restart it on
two different machines, then from the second iteration on the :DIS
value is *very* high (0.xx, no guarantee that this is the behaviour in
all possible environments, but that's what we saw).
* If you still have an ifc7.1 around, you could check differences
between a serial run (any compiler), ifc7.1 (serial and parallel) and
ifort 8 or 9 (parallel). If you have the bug, the latter will be
different from any of the former.
7) for the AUTHORS: during the various tests, some minor things
appeared that might be considered in further updates:
@@@ in lapw2para:
Consider 'touch $case.weigh${updn}_1 (...)' and 'rm
$case.weigh${updn}_* (...)' in the NFS bug fix lines (was mentioned in
the ML some months ago, but not yet introduced in the updates)
@@@ In SRC_aim/d1mach.c, an extra line
#include <stdlib.h>
could be useful (some c-compilers fail on this).
@@@ Rather unrelated, but including vsp and vorb files in the
save_lapw and restore_lapw commands would be useful (if not, there are
circumstances in which an unnecessarily high :DIS appears in the first
iterations after a restore).
That's all...
Stefaan
Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm
More information about the Wien
mailing list