[Wien] Wien2k stopped working

Luis Ogando lcodacal at gmail.com
Thu May 29 18:48:49 CEST 2014


Dear Prof. Marks,

   Thank you very much for your comments.
   I suspect that "sys_adm" changed something they "assumed" harmless for
users.
   I will follow your suggestions.
   All the best,
                 Luis


2014-05-29 10:19 GMT-03:00 Laurence Marks <L-marks at northwestern.edu>:

> Problems such as this are hard to help with remotely. It looks like
> something has gone wrong at the system level, and my guess is that it
> has one of two sources:
>
> a. Something has gone wrong with your account/directories. It could be
> as simple as your time allocation has expired, your password has been
> hacked or your .bashrc file has got corrupted. Check the basics, e.g.
> you can create a file, compile a simple program etc. While this is
> unlikely, you never know.
>
> b. There have been OS changes "of some sort". Many sys_admins assume
> that users just employ the software that is provided, often using
> modules, and this is not compatible with how Wien2k runs. It may be
> that they have removed some of the libraries that you linked Wien2k
> against, changed how the nodes list is provided to you (which may
> break Machines2W). For instance, the overwriting of OMP_NUM_THREADS
> implies to me that someone has decided that "of course" you want to
> run using openmpi, which at least at the moment is not useful to you.
> (I know PB wants to change this, so sometime this statement may
> change.)
>
> Try some diagnostics to try and work out what has happened, for instance:
> * Compile something like "hello world" both mpi and non-mpi versions,
> then run it in a simple job.
> * Write a small script to interrogate the environment when you start a
> job, e.g. using commands such as "ldd $WIENROOT/lapw1_mpi", "env |
> grep -e MPI -e MKL" as well as obvious ones such as ulimit, "echo
> $PATH" etc.
> * Check the cluster web-page, maybe they announced some changes.
> * Use "ifort --version" and similar, as well as "which mpirun" and
> similar -- maybe new.
> * If you know a friendly sys_admin ask them for general info. It is
> good to nurture someone.
>
> Of course, all of this may be totally wrong and you may have already
> sorted things out.
>
>
> On Wed, May 28, 2014 at 8:23 AM, Luis Ogando <lcodacal at gmail.com> wrote:
> > Dear Wien2k community,
> >
> >    I have Wien2k 13.1 installed in a SGI cluster using ifort, icc and
> Open
> > MPI. The installation was a hard work (I would like to thank again the
> help
> > from Prof. Lawrence Marks), but after all I have used Wien2k without
> > problems for several months.
> >    I performed the first step of a long calculation and saved it in a
> > different directory. When I tried the next step in the original
> directory,
> > Wien2k crashed. After some tests, I decided to reinitialize the
> calculation
> > from the beginning (in other words, to repeat the first step). To my
> > surprise, I did not succeed even in this case and I would like to know if
> > someone has faced such an unexpected problem.
> >    Please, find below some of the output files that I consider the most
> > relevant ones.
> >    Finally, I would like to stress some points:
> >
> > 1) lapw0 stops after more or less 7 minutes, but it took about 2 hours in
> > the successful calculation.
> >
> > 2) lapw1 stops after 5 seconds without generating the case.energy_* files
> > and case.dayfile does not contain the time statistic for each processor.
> >
> > 3) OMP_NUM_THREADS=12 is overwritten by the system (in my .bashrc I have
> > OMP_NUM_THREADS=1), but even when I export this variable equal to 1 in
> the
> > submission script, I get the same crash.
> >
> >    Thank you very much for your attention,
> >               Luis
> > ===========================================================
> > :log file
> >
> >>   (init_lapw) options:
> > Wed Apr  2 14:07:30 BRT 2014> (x_lapw) nn -f InPzb15InPwurt3-V2
> > Wed Apr  2 14:07:46 BRT 2014> (x) nn
> > Wed Apr  2 14:08:03 BRT 2014> (x) sgroup
> > Wed Apr  2 14:08:23 BRT 2014> (x) symmetry
> > Wed Apr  2 14:08:48 BRT 2014> (x) lstart
> > Wed Apr  2 14:09:38 BRT 2014> (x) kgen
> > Wed Apr  2 14:09:58 BRT 2014> (x) dstart -c -p
> >>   (initso_lapw) options:
> > Tue May 27 16:07:00 BRT 2014> (x) Machines2W
> >>   (run_lapw) options: -p -NI -ec 0.0001 -cc 0.0001 -i 150 -it
> > Tue May 27 16:07:00 BRT 2014> (x) lapw0 -p
> > Tue May 27 16:14:10 BRT 2014> (x) lapw1 -it -p -c
> > Tue May 27 16:14:15 BRT 2014> (x) lapw2 -p -c
> >
> > ===========================================================
> > case.dayfile
> >
> > Calculating InPzb15InPwurt3-V2 in
> >
> /home/ice/proj/proj546/ogando/Wien/Calculos/InP/InPzbInPwurt/15camadasZB+3WZ/InPzb15InPwurt3-V2
> > on r1i0n15 with PID 6538
> > using WIEN2k_13.1 (Release 17/6/2013) in
> > /home/ice/proj/proj546/ogando/Wien/Executaveis-13-OpenMPI
> >
> >
> >     start (Tue May 27 16:07:00 BRT 2014) with lapw0 (150/99 to go)
> >
> >     cycle 1 (Tue May 27 16:07:00 BRT 2014) (150/99 to go)
> >
> >>   lapw0 -p (16:07:00) starting parallel lapw0 at Tue May 27 16:07:00 BRT
> >> 2014
> > -------- .machine0 : 12 processors
> > 2540.314u 12.204s 7:09.36 594.4% 0+0k 180672+52736io 5pf+0w
> >>   lapw1 -it -p   -c (16:14:10) starting parallel lapw1 at Tue May 27
> >> 16:14:10 BRT 2014
> > ->  starting parallel LAPW1 jobs at Tue May 27 16:14:10 BRT 2014
> > running LAPW1 in parallel mode (using .machines)
> > 12 number_of_parallel_jobs
> >      r1i0n15(1)      r1i0n15(1)      r1i0n15(1)      r1i0n15(1)
> > r1i0n15(1)      r1i0n15(1)      r1i0n15(1)      r1i0n15(1)
>  r1i0n15(1)
> > r1i0n15(1)      r1i0n15(1)      r1i0n15(1)    Summary of lapw1para:
> >    r1i0n15 k=1 user=0 wallclock=1
> > 0.132u 0.136s 0:04.75 5.4% 0+0k 4104+1688io 5pf+0w
> >>   lapw2 -p   -c   (16:14:15) running LAPW2 in parallel mode
> > **  LAPW2 crashed!
> > 0.396u 0.016s 0:00.66 60.6% 0+0k 6424+11472io 1pf+0w
> > error: command
> > /home/ice/proj/proj546/ogando/Wien/Executaveis-13-OpenMPI/lapw2cpara -c
> > lapw2.def   failed
> >
> >>   stop error
> >
> > ===========================================================
> > lapw2.error (the only non empty case.error)
> >
> > Error in LAPW2
> >  'LAPW2' - can't open unit: 30
> >  'LAPW2' -        filename: InPzb15InPwurt3-V2.energy_1
> > **  testerror: Error in Parallel LAPW2
> >
> > ===========================================================
> > The standard output file
> >
> >
> > OMP_NUM_THREADS =  12
> >
> > -----------------------------------------
> > Inicio do job: Tue May 27 16:07:00 BRT 2014
> > Hostname:  r1i0n15
> > PWD:
> >
> /home/ice/proj/proj546/ogando/Wien/Calculos/InP/InPzbInPwurt/15camadasZB+3WZ/InPzb15InPwurt3-V2
> > 0.000u 0.000s 0:00.05 0.0% 0+0k 8216+24io 1pf+0w
> >  LAPW0 END
> >  LAPW0 END
> >  LAPW0 END
> >  LAPW0 END
> >  LAPW0 END
> >  LAPW0 END
> >  LAPW0 END
> >  LAPW0 END
> >  LAPW0 END
> >  LAPW0 END
> >  LAPW0 END
> >  LAPW0 END
> > grep: .processes: No such file or directory
> > InPzb15InPwurt3-V2.scf1_1: No such file or directory.
> > grep: No match.
> > FERMI - Error
> > cp: cannot stat `.in.tmp': No such file or directory
> >
> >>   stop error
> > Final do job: Tue May 27 16:14:15 BRT 2014
> > -----------------------------------------
> >
> > OMP_NUM_THREADS =  12
> >
> > =======================================
> > My parallel_options file
> >
> > setenv TASKSET "no"
> > setenv USE_REMOTE 1
> > setenv MPI_REMOTE 0
> > setenv WIEN_GRANULARITY 1
> > setenv WIEN_MPIRUN "/home/ice/proj/proj546/ogando/OpenMPIexec/bin/mpirun
> -np
> > _NP_ -machinefile _HOSTS_ _EXEC_"
> >
> >
>
>
>
> --
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> www.numis.northwestern.edu 1-847-491-3996
> Co-Editor, Acta Cryst A
> "Research is to see what everybody else has seen, and to think what
> nobody else has thought"
> Albert Szent-Gyorgi
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20140529/a4380a26/attachment.htm>


More information about the Wien mailing list