[Wien] weird parallel problem

Laurence Marks L-marks at northwestern.edu
Thu Nov 16 17:43:06 CET 2006


If I remember right there were some changes in how I/O is buffered
between versions 7.1 and 8.0. Depending upon how your NFS is setup
this could create problems. Try increasing the sleep time and/or the
NFS options about caching versus writing to disc.

On 11/16/06, Stefaan Cottenier <Stefaan.Cottenier at fys.kuleuven.be> wrote:
> Dear wien2k colleagues,
>
> We are plagued by a weird problem when running k-point parallel cases on
> a pc cluster. Serial and parallel runs of the same case give different
> results (5 to 40 mRy/atom in total energy, up to 1 Bohr magneton in
> orbital moment, one quarter of a Bohr magneton difference in spin
> moment, up to 1e10^21 difference in EFG -- this is by far not numerical
> noise). We could trace the problem back up to the following test:
>
> 1) Run a simple test case (including -orb and -so) for 1 iteration, once
> serial, once k-point 'parallel' over 2 pc's which are both the same
> machine as the one from which the job got launched, and once k-point
> parallel on 2 different pc's. After this single iteration, there are NO
> differences in any output.
>
> 2) Continue with lapw0 and lapw1 of the second iteration. Still NO
> differences.
>
> 3) From the command line, give these commands:
>
> serial case: x lapwso
> parallel on a single machine:
>       ssh machine1 "cd /mydir; lapwso lapwso_1.def"
>       ssh machine1 "cd /mydir; lapwso lapwso_2.def"  <===
> parallel on 2 machines:
>       ssh machine1 "cd /mydir; lapwso lapwso_1.def"
>       ssh machine2 "cd /mydir; lapwso lapwso_2.def"  <===
>
> Now, the files delivered by lapwso are identical in content for the
> serial and dummy parallel case, but different for the truly parallel
> case. In particular, the files from the command that has run on machine2
> differ from the corresponding files that have run on machine1 (marked by
> <=== above). For instance, the eigenvalues printed in case.scfso_2 are
> somewhat different. Als the case.energysoup_2 files are different, etc.
> If we delete the energysoup/dn files and continue the iteration, then
> the two cases yield again nearly the same results.
>
> Of course we thought immediately about run time library problems. But
> the problem remains if we compile everything as static. We have verified
> that the path and $WIENROOT are the same on all machines, and also the
> 'env' gives identical results on both machines (tested by : ssh machine1
> "env > outputfile"). Some other observations:
>
> * The problem disappears if we compile by ifc7.1. All higher versions up
> to ifort9.1 produce the problem. It does not depend on the mkl version
> (7.2 up to 8.1).
> * If we replace lapwso, lapw2c and sumpara of a 'wrong' version by the
> corresponding executables made by ifc7.1, then the problems disappears
> as well.
> * The problem is less severe if -orb is not used. It spin-orbit coupling
> is dropped, differences become even smaller, but are still not absent
> (sumpara is the only culprit here).
> * All these tests were done with the latest wien2k version (6.4).
> * We carefully followed the compilation instructions by Gerhard Fecher.
> * The pc's of the cluster are common Pentiums, OS is Suse 9.2
> * This reminded us to the 'sleepy NFS bug', but the test as reported in
> http://zeus.theochem.tuwien.ac.at/pipermail/wien/2006-March/006864.html
> gave the correct result in our case.
>
> This is highly puzzeling for us. We cannot understand how a static (!)
> executable gives different behaviour depending on whether you run it by
> ssh to yourself, or by ssh to a different machine, both machines having
> the same environment... Perhaps another variant of the NFS bug? The
> problem is hard to notice, as no crashes, warnings or illegal output
> appears -- we are afraid it has affected all our parallel calculations
> of the last few months.
>
> Does someone has a suggestion for further tests or for a solution other
> then going back to ifc7.1? We are out of ideas...
>
> Thanks,
> Stefaan
>
>
>
> Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>


-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
http://www.numis.northwestern.edu


More information about the Wien mailing list