[Wien] Wien2k stopped working

Laurence Marks L-marks at northwestern.edu
Thu May 29 15:19:44 CEST 2014


Problems such as this are hard to help with remotely. It looks like
something has gone wrong at the system level, and my guess is that it
has one of two sources:

a. Something has gone wrong with your account/directories. It could be
as simple as your time allocation has expired, your password has been
hacked or your .bashrc file has got corrupted. Check the basics, e.g.
you can create a file, compile a simple program etc. While this is
unlikely, you never know.

b. There have been OS changes "of some sort". Many sys_admins assume
that users just employ the software that is provided, often using
modules, and this is not compatible with how Wien2k runs. It may be
that they have removed some of the libraries that you linked Wien2k
against, changed how the nodes list is provided to you (which may
break Machines2W). For instance, the overwriting of OMP_NUM_THREADS
implies to me that someone has decided that "of course" you want to
run using openmpi, which at least at the moment is not useful to you.
(I know PB wants to change this, so sometime this statement may
change.)

Try some diagnostics to try and work out what has happened, for instance:
* Compile something like "hello world" both mpi and non-mpi versions,
then run it in a simple job.
* Write a small script to interrogate the environment when you start a
job, e.g. using commands such as "ldd $WIENROOT/lapw1_mpi", "env |
grep -e MPI -e MKL" as well as obvious ones such as ulimit, "echo
$PATH" etc.
* Check the cluster web-page, maybe they announced some changes.
* Use "ifort --version" and similar, as well as "which mpirun" and
similar -- maybe new.
* If you know a friendly sys_admin ask them for general info. It is
good to nurture someone.

Of course, all of this may be totally wrong and you may have already
sorted things out.


On Wed, May 28, 2014 at 8:23 AM, Luis Ogando <lcodacal at gmail.com> wrote:
> Dear Wien2k community,
>
>    I have Wien2k 13.1 installed in a SGI cluster using ifort, icc and Open
> MPI. The installation was a hard work (I would like to thank again the help
> from Prof. Lawrence Marks), but after all I have used Wien2k without
> problems for several months.
>    I performed the first step of a long calculation and saved it in a
> different directory. When I tried the next step in the original directory,
> Wien2k crashed. After some tests, I decided to reinitialize the calculation
> from the beginning (in other words, to repeat the first step). To my
> surprise, I did not succeed even in this case and I would like to know if
> someone has faced such an unexpected problem.
>    Please, find below some of the output files that I consider the most
> relevant ones.
>    Finally, I would like to stress some points:
>
> 1) lapw0 stops after more or less 7 minutes, but it took about 2 hours in
> the successful calculation.
>
> 2) lapw1 stops after 5 seconds without generating the case.energy_* files
> and case.dayfile does not contain the time statistic for each processor.
>
> 3) OMP_NUM_THREADS=12 is overwritten by the system (in my .bashrc I have
> OMP_NUM_THREADS=1), but even when I export this variable equal to 1 in the
> submission script, I get the same crash.
>
>    Thank you very much for your attention,
>               Luis
> ===========================================================
> :log file
>
>>   (init_lapw) options:
> Wed Apr  2 14:07:30 BRT 2014> (x_lapw) nn -f InPzb15InPwurt3-V2
> Wed Apr  2 14:07:46 BRT 2014> (x) nn
> Wed Apr  2 14:08:03 BRT 2014> (x) sgroup
> Wed Apr  2 14:08:23 BRT 2014> (x) symmetry
> Wed Apr  2 14:08:48 BRT 2014> (x) lstart
> Wed Apr  2 14:09:38 BRT 2014> (x) kgen
> Wed Apr  2 14:09:58 BRT 2014> (x) dstart -c -p
>>   (initso_lapw) options:
> Tue May 27 16:07:00 BRT 2014> (x) Machines2W
>>   (run_lapw) options: -p -NI -ec 0.0001 -cc 0.0001 -i 150 -it
> Tue May 27 16:07:00 BRT 2014> (x) lapw0 -p
> Tue May 27 16:14:10 BRT 2014> (x) lapw1 -it -p -c
> Tue May 27 16:14:15 BRT 2014> (x) lapw2 -p -c
>
> ===========================================================
> case.dayfile
>
> Calculating InPzb15InPwurt3-V2 in
> /home/ice/proj/proj546/ogando/Wien/Calculos/InP/InPzbInPwurt/15camadasZB+3WZ/InPzb15InPwurt3-V2
> on r1i0n15 with PID 6538
> using WIEN2k_13.1 (Release 17/6/2013) in
> /home/ice/proj/proj546/ogando/Wien/Executaveis-13-OpenMPI
>
>
>     start (Tue May 27 16:07:00 BRT 2014) with lapw0 (150/99 to go)
>
>     cycle 1 (Tue May 27 16:07:00 BRT 2014) (150/99 to go)
>
>>   lapw0 -p (16:07:00) starting parallel lapw0 at Tue May 27 16:07:00 BRT
>> 2014
> -------- .machine0 : 12 processors
> 2540.314u 12.204s 7:09.36 594.4% 0+0k 180672+52736io 5pf+0w
>>   lapw1 -it -p   -c (16:14:10) starting parallel lapw1 at Tue May 27
>> 16:14:10 BRT 2014
> ->  starting parallel LAPW1 jobs at Tue May 27 16:14:10 BRT 2014
> running LAPW1 in parallel mode (using .machines)
> 12 number_of_parallel_jobs
>      r1i0n15(1)      r1i0n15(1)      r1i0n15(1)      r1i0n15(1)
> r1i0n15(1)      r1i0n15(1)      r1i0n15(1)      r1i0n15(1)      r1i0n15(1)
> r1i0n15(1)      r1i0n15(1)      r1i0n15(1)    Summary of lapw1para:
>    r1i0n15 k=1 user=0 wallclock=1
> 0.132u 0.136s 0:04.75 5.4% 0+0k 4104+1688io 5pf+0w
>>   lapw2 -p   -c   (16:14:15) running LAPW2 in parallel mode
> **  LAPW2 crashed!
> 0.396u 0.016s 0:00.66 60.6% 0+0k 6424+11472io 1pf+0w
> error: command
> /home/ice/proj/proj546/ogando/Wien/Executaveis-13-OpenMPI/lapw2cpara -c
> lapw2.def   failed
>
>>   stop error
>
> ===========================================================
> lapw2.error (the only non empty case.error)
>
> Error in LAPW2
>  'LAPW2' - can't open unit: 30
>  'LAPW2' -        filename: InPzb15InPwurt3-V2.energy_1
> **  testerror: Error in Parallel LAPW2
>
> ===========================================================
> The standard output file
>
>
> OMP_NUM_THREADS =  12
>
> -----------------------------------------
> Inicio do job: Tue May 27 16:07:00 BRT 2014
> Hostname:  r1i0n15
> PWD:
> /home/ice/proj/proj546/ogando/Wien/Calculos/InP/InPzbInPwurt/15camadasZB+3WZ/InPzb15InPwurt3-V2
> 0.000u 0.000s 0:00.05 0.0% 0+0k 8216+24io 1pf+0w
>  LAPW0 END
>  LAPW0 END
>  LAPW0 END
>  LAPW0 END
>  LAPW0 END
>  LAPW0 END
>  LAPW0 END
>  LAPW0 END
>  LAPW0 END
>  LAPW0 END
>  LAPW0 END
>  LAPW0 END
> grep: .processes: No such file or directory
> InPzb15InPwurt3-V2.scf1_1: No such file or directory.
> grep: No match.
> FERMI - Error
> cp: cannot stat `.in.tmp': No such file or directory
>
>>   stop error
> Final do job: Tue May 27 16:14:15 BRT 2014
> -----------------------------------------
>
> OMP_NUM_THREADS =  12
>
> =======================================
> My parallel_options file
>
> setenv TASKSET "no"
> setenv USE_REMOTE 1
> setenv MPI_REMOTE 0
> setenv WIEN_GRANULARITY 1
> setenv WIEN_MPIRUN "/home/ice/proj/proj546/ogando/OpenMPIexec/bin/mpirun -np
> _NP_ -machinefile _HOSTS_ _EXEC_"
>
>



-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
Co-Editor, Acta Cryst A
"Research is to see what everybody else has seen, and to think what
nobody else has thought"
Albert Szent-Gyorgi


More information about the Wien mailing list