[Wien] weird parallel problem
Stefaan Cottenier
Stefaan.Cottenier at fys.kuleuven.be
Thu Nov 16 16:10:44 CET 2006
Dear wien2k colleagues,
We are plagued by a weird problem when running k-point parallel cases on
a pc cluster. Serial and parallel runs of the same case give different
results (5 to 40 mRy/atom in total energy, up to 1 Bohr magneton in
orbital moment, one quarter of a Bohr magneton difference in spin
moment, up to 1e10^21 difference in EFG -- this is by far not numerical
noise). We could trace the problem back up to the following test:
1) Run a simple test case (including -orb and -so) for 1 iteration, once
serial, once k-point 'parallel' over 2 pc's which are both the same
machine as the one from which the job got launched, and once k-point
parallel on 2 different pc's. After this single iteration, there are NO
differences in any output.
2) Continue with lapw0 and lapw1 of the second iteration. Still NO
differences.
3) From the command line, give these commands:
serial case: x lapwso
parallel on a single machine:
ssh machine1 "cd /mydir; lapwso lapwso_1.def"
ssh machine1 "cd /mydir; lapwso lapwso_2.def" <===
parallel on 2 machines:
ssh machine1 "cd /mydir; lapwso lapwso_1.def"
ssh machine2 "cd /mydir; lapwso lapwso_2.def" <===
Now, the files delivered by lapwso are identical in content for the
serial and dummy parallel case, but different for the truly parallel
case. In particular, the files from the command that has run on machine2
differ from the corresponding files that have run on machine1 (marked by
<=== above). For instance, the eigenvalues printed in case.scfso_2 are
somewhat different. Als the case.energysoup_2 files are different, etc.
If we delete the energysoup/dn files and continue the iteration, then
the two cases yield again nearly the same results.
Of course we thought immediately about run time library problems. But
the problem remains if we compile everything as static. We have verified
that the path and $WIENROOT are the same on all machines, and also the
'env' gives identical results on both machines (tested by : ssh machine1
"env > outputfile"). Some other observations:
* The problem disappears if we compile by ifc7.1. All higher versions up
to ifort9.1 produce the problem. It does not depend on the mkl version
(7.2 up to 8.1).
* If we replace lapwso, lapw2c and sumpara of a 'wrong' version by the
corresponding executables made by ifc7.1, then the problems disappears
as well.
* The problem is less severe if -orb is not used. It spin-orbit coupling
is dropped, differences become even smaller, but are still not absent
(sumpara is the only culprit here).
* All these tests were done with the latest wien2k version (6.4).
* We carefully followed the compilation instructions by Gerhard Fecher.
* The pc's of the cluster are common Pentiums, OS is Suse 9.2
* This reminded us to the 'sleepy NFS bug', but the test as reported in
http://zeus.theochem.tuwien.ac.at/pipermail/wien/2006-March/006864.html
gave the correct result in our case.
This is highly puzzeling for us. We cannot understand how a static (!)
executable gives different behaviour depending on whether you run it by
ssh to yourself, or by ssh to a different machine, both machines having
the same environment... Perhaps another variant of the NFS bug? The
problem is hard to notice, as no crashes, warnings or illegal output
appears -- we are afraid it has affected all our parallel calculations
of the last few months.
Does someone has a suggestion for further tests or for a solution other
then going back to ifc7.1? We are out of ideas...
Thanks,
Stefaan
Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm
More information about the Wien
mailing list