[Wien] Parallel job error

Laurence Marks L-marks at northwestern.edu
Tue Mar 6 19:22:06 CET 2007


If you have to use more than 3 seconds, and you are running on
seperate computers, you 99.99% have an nfs problem, either the kernel
bug or an incorrectly setup. Be aware that this is a very nasty
problem. Search the mailing list for nsf, and also see

http://zeus.theochem.tuwien.ac.at/pipermail/wien/2006-November/008332.html
http://zeus.theochem.tuwien.ac.at/pipermail/wien/2006-June/007263.html

The reaction of most system administrators will be that you don't know
what you are talking about, there is nothing wrong with their nfs.
They are probably wrong!

On 3/6/07, jadhikari at clarku.edu <jadhikari at clarku.edu> wrote:
> Prof. L Marks,
>
> Thank you very much for the suggestions. I tried all the combinations and
> finally found that the problem is with sleep and delay parameters which I
> do not know how to set for my current calculations.
>
> Subin
>
> > It is 99.99% certain that this has nothing to do with Wien2k or
> > compilation options. It sounds like the jobs are not being
> > appropriately dispatched to the different processors. If they are on
> > different machines, that means that something is failing in terms of
> > your nfs/ssh communications and it might be as simple as a incorrect
> > nfs setup or the nfs bug (search the mailing list). In this case do a
> > simple ssh to the other node and see if it is running via top, also
> > look in :log, case.dayfile and/or turn on debuggin in lapw1para so you
> > can find out what is going on, look also in the case.output1_?? files
> >
> > If it is running many points on a single, multiprocessor machine it
> > might be something similar -- is your ssh/rsh or whatever working
> > right? Use top and ps.
> >
> > On 3/2/07, jadhikari at clarku.edu <jadhikari at clarku.edu> wrote:
> >> Dear Wien users,
> >>
> >> The calculation (with the input file below) runs to convergence with 1 k
> >> point and 1 processor. It takes about 18 hours for 18 cycles.
> >>
> >> Then with higher number of k points with multiple processor it always
> >> fails.  It gets stuck after LAPW1 END in the first cycle. Following is
> >> the
> >> part of dayfile-
> >>
> >> [1]  - Done                      ( cd $PWD; $t $exe ${def}_$loop.def; rm
> >> -f .lock_$lockfile[$p] ) >>  ...
> >>
> >> Then it seems to be static and not moving forward.
> >>
> >> This calculation involving NaNbO3 has never converged in a parallel mode
> >> but   I could manage with other systems like TiO2 and NbO2. Regarding
> >> static and floating as previously mentioned in board for compiler
> >> options/flags, it seems this is not an issue in our case.
> >>
> >> Is there any other parameter that has to be set after switching to
> >> different  space groups or system with different number of atoms? I
> >> guess
> >> this is not the cause. Why is this system different from NbO2 which runs
> >> fine in a parallel mode? There is something with parallel option in the
> >> present case that we are missing.
> >>
> >> Any help regarding fixing of this error will be highly appreciated.
> >>
> >> Have a happy lunar eclipse.
> >> Subin
> >>
> >>
> >> case.in1
> >> __________________________________________________________
> >> WFFIL        (WFPRI, SUPWF)
> >>   7.00       10    4 (R-MT*K-MAX; MAX L IN WF, V-NMT
> >>  .05320   5   0      global e-param with N other choices, napw
> >>  0    0.140     0.000 CONT 1
> >>  0   -3.248     0.002 CONT 1
> >>  1    0.238     0.000 CONT 1
> >>  1   -1.189     0.000 CONT 1
> >>  2    0.215     0.000 CONT 1
> >>  .05320   5   0      global e-param with N other choices, napw
> >>  0    0.111     0.000 CONT 1
> >>  0   -3.265     0.002 CONT 1
> >>  1    0.204     0.000 CONT 1
> >>  1   -1.206     0.000 CONT 1
> >>  2    0.195     0.000 CONT 1
> >>  .05320   6   0      global e-param with N other choices, napw
> >>  0    0.054     0.000 CONT 1
> >>  0   -3.611     0.002 CONT 1
> >>  1    0.225     0.000 CONT 1
> >>  1   -1.858     0.000 CONT 1
> >>  2    0.096     0.000 CONT 1
> >>  2   -0.867     0.000 CONT 1
> >>  .05320   3   0      global e-param with N other choices, napw
> >>  0    0.184     0.000 CONT 1
> >>  0   -0.851     0.000 CONT 1
> >>  1    0.201     0.000 CONT 1
> >>  .05320   3   0      global e-param with N other choices, napw
> >>  0    0.187     0.000 CONT 1
> >>  0   -0.870     0.000 CONT 1
> >>  1    0.184     0.000 CONT 1
> >>  .05320   3   0      global e-param with N other choices, napw
> >>  0    0.201     0.000 CONT 1
> >>  0   -0.884     0.000 CONT 1
> >>  1    0.167     0.000 CONT 1
> >>  .05320   3   0      global e-param with N other choices, napw
> >>  0    0.204     0.000 CONT 1
> >>  0   -0.815     0.000 CONT 1
> >>  1    0.234     0.000 CONT 1
> >> K-VECTORS FROM UNIT:4   -10.0       2.0      emin/emax window
> >> _______________________________________________________________
> >> Case.struct
> >>
> >> Sodium Niobate
> >> P   LATTICE,NONEQUIV.ATOMS:  757_Pbcm
> >> MODE OF CALC=RELA unit=bohr
> >>  10.404800 10.518200 29.328600 90.000000 90.000000 90.000000
> >> ATOM  -1: X=0.24300000 Y=0.75000000 Z=0.00000000
> >>           MULT= 4          ISPLIT= 8
> >>       -1: X=0.75700000 Y=0.25000000 Z=0.00000000
> >>       -1: X=0.24300000 Y=0.75000000 Z=0.50000000
> >>       -1: X=0.75700000 Y=0.25000000 Z=0.50000000
> >> Na1        NPT=  781  R0=0.00010000 RMT=    2.5000   Z: 11.0
> >> LOCAL ROT MATRIX:    0.0000000 0.0000000 1.0000000
> >>                      1.0000000 0.0000000 0.0000000
> >>                      0.0000000 1.0000000 0.0000000
> >> ATOM  -2: X=0.23900000 Y=0.78200000 Z=0.25000000
> >>           MULT= 4          ISPLIT= 8
> >>       -2: X=0.76100000 Y=0.21800000 Z=0.75000000
> >>       -2: X=0.76100000 Y=0.28200000 Z=0.25000000
> >>       -2: X=0.23900000 Y=0.71800000 Z=0.75000000
> >> Na2        NPT=  781  R0=0.00010000 RMT=    2.5000   Z: 11.0
> >> LOCAL ROT MATRIX:    1.0000000 0.0000000 0.0000000
> >>                      0.0000000 1.0000000 0.0000000
> >>                      0.0000000 0.0000000 1.0000000
> >> ATOM  -3: X=0.25660000 Y=0.27220000 Z=0.12620000
> >>           MULT= 8          ISPLIT= 8
> >>       -3: X=0.74340000 Y=0.72780000 Z=0.87380000
> >>       -3: X=0.25660000 Y=0.27220000 Z=0.37380000
> >>       -3: X=0.74340000 Y=0.72780000 Z=0.62620000
> >>       -3: X=0.74340000 Y=0.77220000 Z=0.12620000
> >>       -3: X=0.25660000 Y=0.22780000 Z=0.87380000
> >>       -3: X=0.74340000 Y=0.77220000 Z=0.37380000
> >>       -3: X=0.25660000 Y=0.22780000 Z=0.62620000
> >> Nb         NPT=  781  R0=0.00010000 RMT=    1.8000   Z: 41.0
> >> LOCAL ROT MATRIX:    1.0000000 0.0000000 0.0000000
> >>                      0.0000000 1.0000000 0.0000000
> >>                      0.0000000 0.0000000 1.0000000
> >> ATOM  -4: X=0.30400000 Y=0.25000000 Z=0.00000000
> >>           MULT= 4          ISPLIT= 8
> >>       -4: X=0.69600000 Y=0.75000000 Z=0.00000000
> >>       -4: X=0.30400000 Y=0.25000000 Z=0.50000000
> >>       -4: X=0.69600000 Y=0.75000000 Z=0.50000000
> >> O 1        NPT=  781  R0=0.00010000 RMT=    1.4000   Z:  8.0
> >> LOCAL ROT MATRIX:    0.0000000 0.0000000 1.0000000
> >>                      1.0000000 0.0000000 0.0000000
> >>                      0.0000000 1.0000000 0.0000000
> >> ATOM  -5: X=0.19100000 Y=0.23300000 Z=0.25000000
> >>           MULT= 4          ISPLIT= 8
> >>       -5: X=0.80900000 Y=0.76700000 Z=0.75000000
> >>       -5: X=0.80900000 Y=0.73300000 Z=0.25000000
> >>       -5: X=0.19100000 Y=0.26700000 Z=0.75000000
> >> O 2        NPT=  781  R0=0.00010000 RMT=    1.4000   Z:  8.0
> >> LOCAL ROT MATRIX:    1.0000000 0.0000000 0.0000000
> >>                      0.0000000 1.0000000 0.0000000
> >>                      0.0000000 0.0000000 1.0000000
> >> ATOM  -6: X=0.53600000 Y=0.03200000 Z=0.14000000
> >>           MULT= 8          ISPLIT= 8
> >>       -6: X=0.46400000 Y=0.96800000 Z=0.86000000
> >>       -6: X=0.53600000 Y=0.03200000 Z=0.36000000
> >>       -6: X=0.46400000 Y=0.96800000 Z=0.64000000
> >>       -6: X=0.46400000 Y=0.53200000 Z=0.14000000
> >>       -6: X=0.53600000 Y=0.46800000 Z=0.86000000
> >>       -6: X=0.46400000 Y=0.53200000 Z=0.36000000
> >>       -6: X=0.53600000 Y=0.46800000 Z=0.64000000
> >> O 3        NPT=  781  R0=0.00010000 RMT=    1.4000   Z:  8.0
> >> LOCAL ROT MATRIX:    1.0000000 0.0000000 0.0000000
> >>                      0.0000000 1.0000000 0.0000000
> >>                      0.0000000 0.0000000 1.0000000
> >> ATOM  -7: X=0.96600000 Y=0.46700000 Z=0.11000000
> >>           MULT= 8          ISPLIT= 8
> >>       -7: X=0.03400000 Y=0.53300000 Z=0.89000000
> >>       -7: X=0.96600000 Y=0.46700000 Z=0.39000000
> >>       -7: X=0.03400000 Y=0.53300000 Z=0.61000000
> >>       -7: X=0.03400000 Y=0.96700000 Z=0.11000000
> >>       -7: X=0.96600000 Y=0.03300000 Z=0.89000000
> >>       -7: X=0.03400000 Y=0.96700000 Z=0.39000000
> >>       -7: X=0.96600000 Y=0.03300000 Z=0.61000000
> >> O 4        NPT=  781  R0=0.00010000 RMT=    1.4000   Z:  8.0
> >> LOCAL ROT MATRIX:    1.0000000 0.0000000 0.0000000
> >>                      0.0000000 1.0000000 0.0000000
> >>                      0.0000000 0.0000000 1.0000000
> >>    8      NUMBER OF SYMMETRY OPERATIONS
> >> -1 0 0 0.00000000
> >>  0-1 0 0.00000000
> >>  0 0-1 0.00000000
> >>        1
> >>  1 0 0 0.00000000
> >>  0 1 0 0.00000000
> >>  0 0 1 0.00000000
> >>        2
> >> -1 0 0 0.00000000
> >>  0-1 0 0.00000000
> >>  0 0 1 0.50000000
> >>        3
> >> -1 0 0 0.00000000
> >>  0 1 0 0.50000000
> >>  0 0-1 0.50000000
> >>        4
> >> -1 0 0 0.00000000
> >>  0 1 0 0.50000000
> >>  0 0 1 0.00000000
> >>        5
> >>  1 0 0 0.00000000
> >>  0-1 0 0.50000000
> >>  0 0-1 0.00000000
> >>        6
> >>  1 0 0 0.00000000
> >>  0-1 0 0.50000000
> >>  0 0 1 0.50000000
> >>        7
> >>  1 0 0 0.00000000
> >>  0 1 0 0.00000000
> >>  0 0-1 0.50000000
> >>        8
> >> ___________________________________________________________
> >>
> >> _______________________________________________
> >> Wien mailing list
> >> Wien at zeus.theochem.tuwien.ac.at
> >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >>
> >
> >
> > --
> > Laurence Marks
> > Department of Materials Science and Engineering
> > MSE Rm 2036 Cook Hall
> > 2220 N Campus Drive
> > Northwestern University
> > Evanston, IL 60208, USA
> > Tel: (847) 491-3996 Fax: (847) 491-7820
> > email: L-marks at northwestern dot edu
> > Web: www.numis.northwestern.edu
> > EMM2007 http://ns.crys.ras.ru/EMMM07/
> > _______________________________________________
> > Wien mailing list
> > Wien at zeus.theochem.tuwien.ac.at
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >
> >
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>


-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
EMM2007 http://ns.crys.ras.ru/EMMM07/


More information about the Wien mailing list