[Wien] Parallel job error
Laurence Marks
L-marks at northwestern.edu
Tue Mar 6 19:22:06 CET 2007
If you have to use more than 3 seconds, and you are running on
seperate computers, you 99.99% have an nfs problem, either the kernel
bug or an incorrectly setup. Be aware that this is a very nasty
problem. Search the mailing list for nsf, and also see
http://zeus.theochem.tuwien.ac.at/pipermail/wien/2006-November/008332.html
http://zeus.theochem.tuwien.ac.at/pipermail/wien/2006-June/007263.html
The reaction of most system administrators will be that you don't know
what you are talking about, there is nothing wrong with their nfs.
They are probably wrong!
On 3/6/07, jadhikari at clarku.edu <jadhikari at clarku.edu> wrote:
> Prof. L Marks,
>
> Thank you very much for the suggestions. I tried all the combinations and
> finally found that the problem is with sleep and delay parameters which I
> do not know how to set for my current calculations.
>
> Subin
>
> > It is 99.99% certain that this has nothing to do with Wien2k or
> > compilation options. It sounds like the jobs are not being
> > appropriately dispatched to the different processors. If they are on
> > different machines, that means that something is failing in terms of
> > your nfs/ssh communications and it might be as simple as a incorrect
> > nfs setup or the nfs bug (search the mailing list). In this case do a
> > simple ssh to the other node and see if it is running via top, also
> > look in :log, case.dayfile and/or turn on debuggin in lapw1para so you
> > can find out what is going on, look also in the case.output1_?? files
> >
> > If it is running many points on a single, multiprocessor machine it
> > might be something similar -- is your ssh/rsh or whatever working
> > right? Use top and ps.
> >
> > On 3/2/07, jadhikari at clarku.edu <jadhikari at clarku.edu> wrote:
> >> Dear Wien users,
> >>
> >> The calculation (with the input file below) runs to convergence with 1 k
> >> point and 1 processor. It takes about 18 hours for 18 cycles.
> >>
> >> Then with higher number of k points with multiple processor it always
> >> fails. It gets stuck after LAPW1 END in the first cycle. Following is
> >> the
> >> part of dayfile-
> >>
> >> [1] - Done ( cd $PWD; $t $exe ${def}_$loop.def; rm
> >> -f .lock_$lockfile[$p] ) >> ...
> >>
> >> Then it seems to be static and not moving forward.
> >>
> >> This calculation involving NaNbO3 has never converged in a parallel mode
> >> but I could manage with other systems like TiO2 and NbO2. Regarding
> >> static and floating as previously mentioned in board for compiler
> >> options/flags, it seems this is not an issue in our case.
> >>
> >> Is there any other parameter that has to be set after switching to
> >> different space groups or system with different number of atoms? I
> >> guess
> >> this is not the cause. Why is this system different from NbO2 which runs
> >> fine in a parallel mode? There is something with parallel option in the
> >> present case that we are missing.
> >>
> >> Any help regarding fixing of this error will be highly appreciated.
> >>
> >> Have a happy lunar eclipse.
> >> Subin
> >>
> >>
> >> case.in1
> >> __________________________________________________________
> >> WFFIL (WFPRI, SUPWF)
> >> 7.00 10 4 (R-MT*K-MAX; MAX L IN WF, V-NMT
> >> .05320 5 0 global e-param with N other choices, napw
> >> 0 0.140 0.000 CONT 1
> >> 0 -3.248 0.002 CONT 1
> >> 1 0.238 0.000 CONT 1
> >> 1 -1.189 0.000 CONT 1
> >> 2 0.215 0.000 CONT 1
> >> .05320 5 0 global e-param with N other choices, napw
> >> 0 0.111 0.000 CONT 1
> >> 0 -3.265 0.002 CONT 1
> >> 1 0.204 0.000 CONT 1
> >> 1 -1.206 0.000 CONT 1
> >> 2 0.195 0.000 CONT 1
> >> .05320 6 0 global e-param with N other choices, napw
> >> 0 0.054 0.000 CONT 1
> >> 0 -3.611 0.002 CONT 1
> >> 1 0.225 0.000 CONT 1
> >> 1 -1.858 0.000 CONT 1
> >> 2 0.096 0.000 CONT 1
> >> 2 -0.867 0.000 CONT 1
> >> .05320 3 0 global e-param with N other choices, napw
> >> 0 0.184 0.000 CONT 1
> >> 0 -0.851 0.000 CONT 1
> >> 1 0.201 0.000 CONT 1
> >> .05320 3 0 global e-param with N other choices, napw
> >> 0 0.187 0.000 CONT 1
> >> 0 -0.870 0.000 CONT 1
> >> 1 0.184 0.000 CONT 1
> >> .05320 3 0 global e-param with N other choices, napw
> >> 0 0.201 0.000 CONT 1
> >> 0 -0.884 0.000 CONT 1
> >> 1 0.167 0.000 CONT 1
> >> .05320 3 0 global e-param with N other choices, napw
> >> 0 0.204 0.000 CONT 1
> >> 0 -0.815 0.000 CONT 1
> >> 1 0.234 0.000 CONT 1
> >> K-VECTORS FROM UNIT:4 -10.0 2.0 emin/emax window
> >> _______________________________________________________________
> >> Case.struct
> >>
> >> Sodium Niobate
> >> P LATTICE,NONEQUIV.ATOMS: 757_Pbcm
> >> MODE OF CALC=RELA unit=bohr
> >> 10.404800 10.518200 29.328600 90.000000 90.000000 90.000000
> >> ATOM -1: X=0.24300000 Y=0.75000000 Z=0.00000000
> >> MULT= 4 ISPLIT= 8
> >> -1: X=0.75700000 Y=0.25000000 Z=0.00000000
> >> -1: X=0.24300000 Y=0.75000000 Z=0.50000000
> >> -1: X=0.75700000 Y=0.25000000 Z=0.50000000
> >> Na1 NPT= 781 R0=0.00010000 RMT= 2.5000 Z: 11.0
> >> LOCAL ROT MATRIX: 0.0000000 0.0000000 1.0000000
> >> 1.0000000 0.0000000 0.0000000
> >> 0.0000000 1.0000000 0.0000000
> >> ATOM -2: X=0.23900000 Y=0.78200000 Z=0.25000000
> >> MULT= 4 ISPLIT= 8
> >> -2: X=0.76100000 Y=0.21800000 Z=0.75000000
> >> -2: X=0.76100000 Y=0.28200000 Z=0.25000000
> >> -2: X=0.23900000 Y=0.71800000 Z=0.75000000
> >> Na2 NPT= 781 R0=0.00010000 RMT= 2.5000 Z: 11.0
> >> LOCAL ROT MATRIX: 1.0000000 0.0000000 0.0000000
> >> 0.0000000 1.0000000 0.0000000
> >> 0.0000000 0.0000000 1.0000000
> >> ATOM -3: X=0.25660000 Y=0.27220000 Z=0.12620000
> >> MULT= 8 ISPLIT= 8
> >> -3: X=0.74340000 Y=0.72780000 Z=0.87380000
> >> -3: X=0.25660000 Y=0.27220000 Z=0.37380000
> >> -3: X=0.74340000 Y=0.72780000 Z=0.62620000
> >> -3: X=0.74340000 Y=0.77220000 Z=0.12620000
> >> -3: X=0.25660000 Y=0.22780000 Z=0.87380000
> >> -3: X=0.74340000 Y=0.77220000 Z=0.37380000
> >> -3: X=0.25660000 Y=0.22780000 Z=0.62620000
> >> Nb NPT= 781 R0=0.00010000 RMT= 1.8000 Z: 41.0
> >> LOCAL ROT MATRIX: 1.0000000 0.0000000 0.0000000
> >> 0.0000000 1.0000000 0.0000000
> >> 0.0000000 0.0000000 1.0000000
> >> ATOM -4: X=0.30400000 Y=0.25000000 Z=0.00000000
> >> MULT= 4 ISPLIT= 8
> >> -4: X=0.69600000 Y=0.75000000 Z=0.00000000
> >> -4: X=0.30400000 Y=0.25000000 Z=0.50000000
> >> -4: X=0.69600000 Y=0.75000000 Z=0.50000000
> >> O 1 NPT= 781 R0=0.00010000 RMT= 1.4000 Z: 8.0
> >> LOCAL ROT MATRIX: 0.0000000 0.0000000 1.0000000
> >> 1.0000000 0.0000000 0.0000000
> >> 0.0000000 1.0000000 0.0000000
> >> ATOM -5: X=0.19100000 Y=0.23300000 Z=0.25000000
> >> MULT= 4 ISPLIT= 8
> >> -5: X=0.80900000 Y=0.76700000 Z=0.75000000
> >> -5: X=0.80900000 Y=0.73300000 Z=0.25000000
> >> -5: X=0.19100000 Y=0.26700000 Z=0.75000000
> >> O 2 NPT= 781 R0=0.00010000 RMT= 1.4000 Z: 8.0
> >> LOCAL ROT MATRIX: 1.0000000 0.0000000 0.0000000
> >> 0.0000000 1.0000000 0.0000000
> >> 0.0000000 0.0000000 1.0000000
> >> ATOM -6: X=0.53600000 Y=0.03200000 Z=0.14000000
> >> MULT= 8 ISPLIT= 8
> >> -6: X=0.46400000 Y=0.96800000 Z=0.86000000
> >> -6: X=0.53600000 Y=0.03200000 Z=0.36000000
> >> -6: X=0.46400000 Y=0.96800000 Z=0.64000000
> >> -6: X=0.46400000 Y=0.53200000 Z=0.14000000
> >> -6: X=0.53600000 Y=0.46800000 Z=0.86000000
> >> -6: X=0.46400000 Y=0.53200000 Z=0.36000000
> >> -6: X=0.53600000 Y=0.46800000 Z=0.64000000
> >> O 3 NPT= 781 R0=0.00010000 RMT= 1.4000 Z: 8.0
> >> LOCAL ROT MATRIX: 1.0000000 0.0000000 0.0000000
> >> 0.0000000 1.0000000 0.0000000
> >> 0.0000000 0.0000000 1.0000000
> >> ATOM -7: X=0.96600000 Y=0.46700000 Z=0.11000000
> >> MULT= 8 ISPLIT= 8
> >> -7: X=0.03400000 Y=0.53300000 Z=0.89000000
> >> -7: X=0.96600000 Y=0.46700000 Z=0.39000000
> >> -7: X=0.03400000 Y=0.53300000 Z=0.61000000
> >> -7: X=0.03400000 Y=0.96700000 Z=0.11000000
> >> -7: X=0.96600000 Y=0.03300000 Z=0.89000000
> >> -7: X=0.03400000 Y=0.96700000 Z=0.39000000
> >> -7: X=0.96600000 Y=0.03300000 Z=0.61000000
> >> O 4 NPT= 781 R0=0.00010000 RMT= 1.4000 Z: 8.0
> >> LOCAL ROT MATRIX: 1.0000000 0.0000000 0.0000000
> >> 0.0000000 1.0000000 0.0000000
> >> 0.0000000 0.0000000 1.0000000
> >> 8 NUMBER OF SYMMETRY OPERATIONS
> >> -1 0 0 0.00000000
> >> 0-1 0 0.00000000
> >> 0 0-1 0.00000000
> >> 1
> >> 1 0 0 0.00000000
> >> 0 1 0 0.00000000
> >> 0 0 1 0.00000000
> >> 2
> >> -1 0 0 0.00000000
> >> 0-1 0 0.00000000
> >> 0 0 1 0.50000000
> >> 3
> >> -1 0 0 0.00000000
> >> 0 1 0 0.50000000
> >> 0 0-1 0.50000000
> >> 4
> >> -1 0 0 0.00000000
> >> 0 1 0 0.50000000
> >> 0 0 1 0.00000000
> >> 5
> >> 1 0 0 0.00000000
> >> 0-1 0 0.50000000
> >> 0 0-1 0.00000000
> >> 6
> >> 1 0 0 0.00000000
> >> 0-1 0 0.50000000
> >> 0 0 1 0.50000000
> >> 7
> >> 1 0 0 0.00000000
> >> 0 1 0 0.00000000
> >> 0 0-1 0.50000000
> >> 8
> >> ___________________________________________________________
> >>
> >> _______________________________________________
> >> Wien mailing list
> >> Wien at zeus.theochem.tuwien.ac.at
> >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >>
> >
> >
> > --
> > Laurence Marks
> > Department of Materials Science and Engineering
> > MSE Rm 2036 Cook Hall
> > 2220 N Campus Drive
> > Northwestern University
> > Evanston, IL 60208, USA
> > Tel: (847) 491-3996 Fax: (847) 491-7820
> > email: L-marks at northwestern dot edu
> > Web: www.numis.northwestern.edu
> > EMM2007 http://ns.crys.ras.ru/EMMM07/
> > _______________________________________________
> > Wien mailing list
> > Wien at zeus.theochem.tuwien.ac.at
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >
> >
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
--
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
EMM2007 http://ns.crys.ras.ru/EMMM07/
More information about the Wien
mailing list