<div dir="ltr">Dear Prof. Marks,<div><br></div><div style> Thank you very much for your comments.</div><div style> I suspect that "sys_adm" changed something they "assumed" harmless for users.</div>
<div style> I will follow your suggestions.</div><div style> All the best,</div><div style> Luis</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">2014-05-29 10:19 GMT-03:00 Laurence Marks <span dir="ltr"><<a href="mailto:L-marks@northwestern.edu" target="_blank">L-marks@northwestern.edu</a>></span>:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Problems such as this are hard to help with remotely. It looks like<br>
something has gone wrong at the system level, and my guess is that it<br>
has one of two sources:<br>
<br>
a. Something has gone wrong with your account/directories. It could be<br>
as simple as your time allocation has expired, your password has been<br>
hacked or your .bashrc file has got corrupted. Check the basics, e.g.<br>
you can create a file, compile a simple program etc. While this is<br>
unlikely, you never know.<br>
<br>
b. There have been OS changes "of some sort". Many sys_admins assume<br>
that users just employ the software that is provided, often using<br>
modules, and this is not compatible with how Wien2k runs. It may be<br>
that they have removed some of the libraries that you linked Wien2k<br>
against, changed how the nodes list is provided to you (which may<br>
break Machines2W). For instance, the overwriting of OMP_NUM_THREADS<br>
implies to me that someone has decided that "of course" you want to<br>
run using openmpi, which at least at the moment is not useful to you.<br>
(I know PB wants to change this, so sometime this statement may<br>
change.)<br>
<br>
Try some diagnostics to try and work out what has happened, for instance:<br>
* Compile something like "hello world" both mpi and non-mpi versions,<br>
then run it in a simple job.<br>
* Write a small script to interrogate the environment when you start a<br>
job, e.g. using commands such as "ldd $WIENROOT/lapw1_mpi", "env |<br>
grep -e MPI -e MKL" as well as obvious ones such as ulimit, "echo<br>
$PATH" etc.<br>
* Check the cluster web-page, maybe they announced some changes.<br>
* Use "ifort --version" and similar, as well as "which mpirun" and<br>
similar -- maybe new.<br>
* If you know a friendly sys_admin ask them for general info. It is<br>
good to nurture someone.<br>
<br>
Of course, all of this may be totally wrong and you may have already<br>
sorted things out.<br>
<div><div class="h5"><br>
<br>
On Wed, May 28, 2014 at 8:23 AM, Luis Ogando <<a href="mailto:lcodacal@gmail.com">lcodacal@gmail.com</a>> wrote:<br>
> Dear Wien2k community,<br>
><br>
> I have Wien2k 13.1 installed in a SGI cluster using ifort, icc and Open<br>
> MPI. The installation was a hard work (I would like to thank again the help<br>
> from Prof. Lawrence Marks), but after all I have used Wien2k without<br>
> problems for several months.<br>
> I performed the first step of a long calculation and saved it in a<br>
> different directory. When I tried the next step in the original directory,<br>
> Wien2k crashed. After some tests, I decided to reinitialize the calculation<br>
> from the beginning (in other words, to repeat the first step). To my<br>
> surprise, I did not succeed even in this case and I would like to know if<br>
> someone has faced such an unexpected problem.<br>
> Please, find below some of the output files that I consider the most<br>
> relevant ones.<br>
> Finally, I would like to stress some points:<br>
><br>
> 1) lapw0 stops after more or less 7 minutes, but it took about 2 hours in<br>
> the successful calculation.<br>
><br>
> 2) lapw1 stops after 5 seconds without generating the case.energy_* files<br>
> and case.dayfile does not contain the time statistic for each processor.<br>
><br>
> 3) OMP_NUM_THREADS=12 is overwritten by the system (in my .bashrc I have<br>
> OMP_NUM_THREADS=1), but even when I export this variable equal to 1 in the<br>
> submission script, I get the same crash.<br>
><br>
> Thank you very much for your attention,<br>
> Luis<br>
> ===========================================================<br>
> :log file<br>
><br>
>> (init_lapw) options:<br>
> Wed Apr 2 14:07:30 BRT 2014> (x_lapw) nn -f InPzb15InPwurt3-V2<br>
> Wed Apr 2 14:07:46 BRT 2014> (x) nn<br>
> Wed Apr 2 14:08:03 BRT 2014> (x) sgroup<br>
> Wed Apr 2 14:08:23 BRT 2014> (x) symmetry<br>
> Wed Apr 2 14:08:48 BRT 2014> (x) lstart<br>
> Wed Apr 2 14:09:38 BRT 2014> (x) kgen<br>
> Wed Apr 2 14:09:58 BRT 2014> (x) dstart -c -p<br>
>> (initso_lapw) options:<br>
> Tue May 27 16:07:00 BRT 2014> (x) Machines2W<br>
>> (run_lapw) options: -p -NI -ec 0.0001 -cc 0.0001 -i 150 -it<br>
> Tue May 27 16:07:00 BRT 2014> (x) lapw0 -p<br>
> Tue May 27 16:14:10 BRT 2014> (x) lapw1 -it -p -c<br>
> Tue May 27 16:14:15 BRT 2014> (x) lapw2 -p -c<br>
><br>
> ===========================================================<br>
> case.dayfile<br>
><br>
> Calculating InPzb15InPwurt3-V2 in<br>
> /home/ice/proj/proj546/ogando/Wien/Calculos/InP/InPzbInPwurt/15camadasZB+3WZ/InPzb15InPwurt3-V2<br>
> on r1i0n15 with PID 6538<br>
> using WIEN2k_13.1 (Release 17/6/2013) in<br>
> /home/ice/proj/proj546/ogando/Wien/Executaveis-13-OpenMPI<br>
><br>
><br>
> start (Tue May 27 16:07:00 BRT 2014) with lapw0 (150/99 to go)<br>
><br>
> cycle 1 (Tue May 27 16:07:00 BRT 2014) (150/99 to go)<br>
><br>
>> lapw0 -p (16:07:00) starting parallel lapw0 at Tue May 27 16:07:00 BRT<br>
>> 2014<br>
> -------- .machine0 : 12 processors<br>
> 2540.314u 12.204s 7:09.36 594.4% 0+0k 180672+52736io 5pf+0w<br>
>> lapw1 -it -p -c (16:14:10) starting parallel lapw1 at Tue May 27<br>
>> 16:14:10 BRT 2014<br>
> -> starting parallel LAPW1 jobs at Tue May 27 16:14:10 BRT 2014<br>
> running LAPW1 in parallel mode (using .machines)<br>
> 12 number_of_parallel_jobs<br>
> r1i0n15(1) r1i0n15(1) r1i0n15(1) r1i0n15(1)<br>
> r1i0n15(1) r1i0n15(1) r1i0n15(1) r1i0n15(1) r1i0n15(1)<br>
> r1i0n15(1) r1i0n15(1) r1i0n15(1) Summary of lapw1para:<br>
> r1i0n15 k=1 user=0 wallclock=1<br>
> 0.132u 0.136s 0:04.75 5.4% 0+0k 4104+1688io 5pf+0w<br>
>> lapw2 -p -c (16:14:15) running LAPW2 in parallel mode<br>
> ** LAPW2 crashed!<br>
> 0.396u 0.016s 0:00.66 60.6% 0+0k 6424+11472io 1pf+0w<br>
> error: command<br>
> /home/ice/proj/proj546/ogando/Wien/Executaveis-13-OpenMPI/lapw2cpara -c<br>
> lapw2.def failed<br>
><br>
>> stop error<br>
><br>
> ===========================================================<br>
> lapw2.error (the only non empty case.error)<br>
><br>
> Error in LAPW2<br>
> 'LAPW2' - can't open unit: 30<br>
> 'LAPW2' - filename: InPzb15InPwurt3-V2.energy_1<br>
> ** testerror: Error in Parallel LAPW2<br>
><br>
> ===========================================================<br>
> The standard output file<br>
><br>
><br>
> OMP_NUM_THREADS = 12<br>
><br>
> -----------------------------------------<br>
> Inicio do job: Tue May 27 16:07:00 BRT 2014<br>
> Hostname: r1i0n15<br>
> PWD:<br>
> /home/ice/proj/proj546/ogando/Wien/Calculos/InP/InPzbInPwurt/15camadasZB+3WZ/InPzb15InPwurt3-V2<br>
> 0.000u 0.000s 0:00.05 0.0% 0+0k 8216+24io 1pf+0w<br>
> LAPW0 END<br>
> LAPW0 END<br>
> LAPW0 END<br>
> LAPW0 END<br>
> LAPW0 END<br>
> LAPW0 END<br>
> LAPW0 END<br>
> LAPW0 END<br>
> LAPW0 END<br>
> LAPW0 END<br>
> LAPW0 END<br>
> LAPW0 END<br>
> grep: .processes: No such file or directory<br>
> InPzb15InPwurt3-V2.scf1_1: No such file or directory.<br>
> grep: No match.<br>
> FERMI - Error<br>
> cp: cannot stat `.in.tmp': No such file or directory<br>
><br>
>> stop error<br>
> Final do job: Tue May 27 16:14:15 BRT 2014<br>
> -----------------------------------------<br>
><br>
> OMP_NUM_THREADS = 12<br>
><br>
> =======================================<br>
> My parallel_options file<br>
><br>
> setenv TASKSET "no"<br>
> setenv USE_REMOTE 1<br>
> setenv MPI_REMOTE 0<br>
> setenv WIEN_GRANULARITY 1<br>
> setenv WIEN_MPIRUN "/home/ice/proj/proj546/ogando/OpenMPIexec/bin/mpirun -np<br>
> _NP_ -machinefile _HOSTS_ _EXEC_"<br>
><br>
><br>
<br>
<br>
<br>
</div></div>--<br>
Professor Laurence Marks<br>
Department of Materials Science and Engineering<br>
Northwestern University<br>
<a href="http://www.numis.northwestern.edu" target="_blank">www.numis.northwestern.edu</a> <a href="tel:1-847-491-3996" value="+18474913996">1-847-491-3996</a><br>
Co-Editor, Acta Cryst A<br>
"Research is to see what everybody else has seen, and to think what<br>
nobody else has thought"<br>
Albert Szent-Gyorgi<br>
_______________________________________________<br>
Wien mailing list<br>
<a href="mailto:Wien@zeus.theochem.tuwien.ac.at">Wien@zeus.theochem.tuwien.ac.at</a><br>
<a href="http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien" target="_blank">http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien</a><br>
SEARCH the MAILING-LIST at: <a href="http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html" target="_blank">http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html</a><br>
</blockquote></div><br></div>