[Wien] weird parallel problem

Stefaan Cottenier Stefaan.Cottenier at fys.kuleuven.be
Thu Nov 16 16:10:44 CET 2006


Dear wien2k colleagues,

We are plagued by a weird problem when running k-point parallel cases on 
a pc cluster. Serial and parallel runs of the same case give different 
results (5 to 40 mRy/atom in total energy, up to 1 Bohr magneton in 
orbital moment, one quarter of a Bohr magneton difference in spin 
moment, up to 1e10^21 difference in EFG -- this is by far not numerical 
noise). We could trace the problem back up to the following test:

1) Run a simple test case (including -orb and -so) for 1 iteration, once 
serial, once k-point 'parallel' over 2 pc's which are both the same 
machine as the one from which the job got launched, and once k-point 
parallel on 2 different pc's. After this single iteration, there are NO 
differences in any output.

2) Continue with lapw0 and lapw1 of the second iteration. Still NO 
differences.

3) From the command line, give these commands:

serial case: x lapwso
parallel on a single machine:
      ssh machine1 "cd /mydir; lapwso lapwso_1.def"
      ssh machine1 "cd /mydir; lapwso lapwso_2.def"  <===
parallel on 2 machines:
      ssh machine1 "cd /mydir; lapwso lapwso_1.def"
      ssh machine2 "cd /mydir; lapwso lapwso_2.def"  <===

Now, the files delivered by lapwso are identical in content for the 
serial and dummy parallel case, but different for the truly parallel 
case. In particular, the files from the command that has run on machine2 
differ from the corresponding files that have run on machine1 (marked by 
<=== above). For instance, the eigenvalues printed in case.scfso_2 are 
somewhat different. Als the case.energysoup_2 files are different, etc. 
If we delete the energysoup/dn files and continue the iteration, then 
the two cases yield again nearly the same results.

Of course we thought immediately about run time library problems. But 
the problem remains if we compile everything as static. We have verified 
that the path and $WIENROOT are the same on all machines, and also the 
'env' gives identical results on both machines (tested by : ssh machine1 
"env > outputfile"). Some other observations:

* The problem disappears if we compile by ifc7.1. All higher versions up 
to ifort9.1 produce the problem. It does not depend on the mkl version 
(7.2 up to 8.1).
* If we replace lapwso, lapw2c and sumpara of a 'wrong' version by the 
corresponding executables made by ifc7.1, then the problems disappears 
as well.
* The problem is less severe if -orb is not used. It spin-orbit coupling 
is dropped, differences become even smaller, but are still not absent 
(sumpara is the only culprit here).
* All these tests were done with the latest wien2k version (6.4).
* We carefully followed the compilation instructions by Gerhard Fecher.
* The pc's of the cluster are common Pentiums, OS is Suse 9.2
* This reminded us to the 'sleepy NFS bug', but the test as reported in 
http://zeus.theochem.tuwien.ac.at/pipermail/wien/2006-March/006864.html 
gave the correct result in our case.

This is highly puzzeling for us. We cannot understand how a static (!) 
executable gives different behaviour depending on whether you run it by 
ssh to yourself, or by ssh to a different machine, both machines having 
the same environment... Perhaps another variant of the NFS bug? The 
problem is hard to notice, as no crashes, warnings or illegal output 
appears -- we are afraid it has affected all our parallel calculations 
of the last few months.

Does someone has a suggestion for further tests or for a solution other 
then going back to ifc7.1? We are out of ideas...

Thanks,
Stefaan



Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm



More information about the Wien mailing list