[Wien] Yes, the sys_adm changed somethings

Thu May 29 22:00:19 CEST 2014

The wonders of sys_admins! A very, very strange decision; I assume
that someone was abusing the system which led to the change.

You are going to have to change many things, not just your options.
The first thing to do it find out what will work to launch a task on
another node without doing a login. It may be that you can use rsh or
something else since (I assume) ssh is forbidden. Hopefully there is
something you can use and just change the variable "remote" uses in
lapwXpara, i.e. something like in parallel_options

setenv remote "Some Command".

If there is nothing, or they are less than helpful, you are going to
have to use something similar to pbsh in SRC_mpiutil/mpi_examples as a
replacement for ssh. This uses mpirun (any flavor should work although
options may change) to create a csh on a remote node, in effect a
login-free ssh. (It can in fact be better than ssh since that has some
bugs when things go wrong.) I assume that they have not broken the
conventional mpi launching of tasks on remote nodes.

You may also need to do some more convolutions. For instance, on one
cluster I use ssh can be badly broken so in parallel_options I have
(impi version)
setenv WIEN_MPIRUN "mpirun -bootstrap-exec $WIENROOT/hopen -n _NP_
-machinefile _HOSTS_ _EXEC_ "

with hopen the file:
#!/bin/bash
a=`echo $@ | sed -e 's/-x -q//'`
/software/mpi/openmpi-1.7.2-intel2013.2/bin/mpirun -np 1 --host $a

This uses openmpi to launch an mpi task for impi (Intel's version)
instead of ssh. Note that it edits the input string to change some
things from impi to openmpi options, which you may need to do as well.

On Thu, May 29, 2014 at 2:24 PM, Luis Ogando <lcodacal at gmail.com> wrote:
> Hi Gavin and Prof. Marks,
>
>    The system administrators answered me and ,yes, they changed somethings:
>
> 1) The users are no more allowed to log in the nodes where jobs are run.
>
> 2) The run of interactive remote jobs (from other nodes in the nodes where
> jobs are run) is now forbidden.
>
>    In other words, the interactive access (login) and remote execution of
> commands are now forbidden.
>    Do I have to change my parallel_options file because of this ??
>    All the best,
>                  Luis
>
>
>
>
> 2014-05-29 14:25 GMT-03:00 Gavin Abo <gsabo at crimson.ua.edu>:
>>
>> Dear Luis,
>>
>> A few comments off the mailing list.
>>
>> I agree with Prof. Marks that this problem is hard to help with remotely.
>>
>> I suppose it is okay, but in your :log file, I find it a little strange
>> that you have "initso_lapw", but it is not a "-so" (spin orbit) calculation
>> that you are running.
>>
>> I haven't used "-it" for any calculations recently.  I doubt it causes the
>> problem, but you might try without it to see if it makes any difference.
>>
>> Is it just this calculation that does not work?  Or other calculations
>> with and without "-p" that do not run?
>>
>> The first error message that I see is "grep: .processes: No such file or
>> directory".  This likely means that there is no .processes file, when there
>> is expected to be one.  I believe the .processes file is create by
>> lapw1para_lapw.  It looks like the .processes file is created from an 'awk'
>> of the .machines file.
>>
>> I see no error message for awk, so awk should be installed on the system.
>> However, it would be best to check that it is installed and working in a
>> terminal with the command:
>>
>> awk -W version
>>
>> I'm not familiar with Machines2W, but if I understand correctly, this is
>> creating the .machines file for your system.
>>
>> If I had the same problem, I would first try to open the .machines file in
>> the directory InPzb15InPwurt3-V2 of the current calculation with a text
>> editor to check if it looks okay or not. Then, I would try to compare it to
>> the .machines file of an earlier calculation that worked.
>>
>> It might be a little helpful (but not a lot) for your current problem, but
>> it is noted that in $WIENROOT/lapw1para_lapw, you could change line 61 from:
>>
>> set debug = 0
>>
>> to
>>
>> set debug = 1
>>
>> to get some debug information for lapw1para_lapw.
>>
>> Kind regards,
>>
>> Gavin
>>
>>
>> On 5/29/2014 7:19 AM, Laurence Marks wrote:
>>>
>>> Problems such as this are hard to help with remotely. It looks like
>>> something has gone wrong at the system level, and my guess is that it
>>> has one of two sources:
>>>
>>> a. Something has gone wrong with your account/directories. It could be
>>> as simple as your time allocation has expired, your password has been
>>> hacked or your .bashrc file has got corrupted. Check the basics, e.g.
>>> you can create a file, compile a simple program etc. While this is
>>> unlikely, you never know.
>>>
>>> b. There have been OS changes "of some sort". Many sys_admins assume
>>> that users just employ the software that is provided, often using
>>> modules, and this is not compatible with how Wien2k runs. It may be
>>> that they have removed some of the libraries that you linked Wien2k
>>> against, changed how the nodes list is provided to you (which may
>>> break Machines2W). For instance, the overwriting of OMP_NUM_THREADS
>>> implies to me that someone has decided that "of course" you want to
>>> run using openmpi, which at least at the moment is not useful to you.
>>> (I know PB wants to change this, so sometime this statement may
>>> change.)
>>>
>>> Try some diagnostics to try and work out what has happened, for instance:
>>> * Compile something like "hello world" both mpi and non-mpi versions,
>>> then run it in a simple job.
>>> * Write a small script to interrogate the environment when you start a
>>> job, e.g. using commands such as "ldd $WIENROOT/lapw1_mpi", "env |
>>> grep -e MPI -e MKL" as well as obvious ones such as ulimit, "echo
>>> $PATH" etc.
>>> * Check the cluster web-page, maybe they announced some changes.
>>> * Use "ifort --version" and similar, as well as "which mpirun" and
>>> similar -- maybe new.
>>> * If you know a friendly sys_admin ask them for general info. It is
>>> good to nurture someone.
>>>
>>> Of course, all of this may be totally wrong and you may have already
>>> sorted things out.
>>>
>>>
>>> On Wed, May 28, 2014 at 8:23 AM, Luis Ogando <lcodacal at gmail.com> wrote:
>>>>
>>>> Dear Wien2k community,
>>>>
>>>>     I have Wien2k 13.1 installed in a SGI cluster using ifort, icc and
>>>> Open
>>>> MPI. The installation was a hard work (I would like to thank again the
>>>> help
>>>> from Prof. Lawrence Marks), but after all I have used Wien2k without
>>>> problems for several months.
>>>>     I performed the first step of a long calculation and saved it in a
>>>> different directory. When I tried the next step in the original
>>>> directory,
>>>> Wien2k crashed. After some tests, I decided to reinitialize the
>>>> calculation
>>>> from the beginning (in other words, to repeat the first step). To my
>>>> surprise, I did not succeed even in this case and I would like to know
>>>> if
>>>> someone has faced such an unexpected problem.
>>>>     Please, find below some of the output files that I consider the most
>>>> relevant ones.
>>>>     Finally, I would like to stress some points:
>>>>
>>>> 1) lapw0 stops after more or less 7 minutes, but it took about 2 hours
>>>> in
>>>> the successful calculation.
>>>>
>>>> 2) lapw1 stops after 5 seconds without generating the case.energy_*
>>>> files
>>>> and case.dayfile does not contain the time statistic for each processor.
>>>>
>>>> 3) OMP_NUM_THREADS=12 is overwritten by the system (in my .bashrc I have
>>>> OMP_NUM_THREADS=1), but even when I export this variable equal to 1 in
>>>> the
>>>> submission script, I get the same crash.
>>>>
>>>>     Thank you very much for your attention,
>>>>                Luis
>>>> ===========================================================
>>>> :log file
>>>>
>>>>>    (init_lapw) options:
>>>>
>>>> Wed Apr  2 14:07:30 BRT 2014> (x_lapw) nn -f InPzb15InPwurt3-V2
>>>> Wed Apr  2 14:07:46 BRT 2014> (x) nn
>>>> Wed Apr  2 14:08:03 BRT 2014> (x) sgroup
>>>> Wed Apr  2 14:08:23 BRT 2014> (x) symmetry
>>>> Wed Apr  2 14:08:48 BRT 2014> (x) lstart
>>>> Wed Apr  2 14:09:38 BRT 2014> (x) kgen
>>>> Wed Apr  2 14:09:58 BRT 2014> (x) dstart -c -p
>>>>>
>>>>>    (initso_lapw) options:
>>>>
>>>> Tue May 27 16:07:00 BRT 2014> (x) Machines2W
>>>>>
>>>>>    (run_lapw) options: -p -NI -ec 0.0001 -cc 0.0001 -i 150 -it
>>>>
>>>> Tue May 27 16:07:00 BRT 2014> (x) lapw0 -p
>>>> Tue May 27 16:14:10 BRT 2014> (x) lapw1 -it -p -c
>>>> Tue May 27 16:14:15 BRT 2014> (x) lapw2 -p -c
>>>>
>>>> ===========================================================
>>>> case.dayfile
>>>>
>>>> Calculating InPzb15InPwurt3-V2 in
>>>>
>>>> /home/ice/proj/proj546/ogando/Wien/Calculos/InP/InPzbInPwurt/15camadasZB+3WZ/InPzb15InPwurt3-V2
>>>> on r1i0n15 with PID 6538
>>>> using WIEN2k_13.1 (Release 17/6/2013) in
>>>> /home/ice/proj/proj546/ogando/Wien/Executaveis-13-OpenMPI
>>>>
>>>>
>>>>      start (Tue May 27 16:07:00 BRT 2014) with lapw0 (150/99 to go)
>>>>
>>>>      cycle 1 (Tue May 27 16:07:00 BRT 2014) (150/99 to go)
>>>>
>>>>>    lapw0 -p (16:07:00) starting parallel lapw0 at Tue May 27 16:07:00
>>>>> BRT
>>>>> 2014
>>>>
>>>> -------- .machine0 : 12 processors
>>>> 2540.314u 12.204s 7:09.36 594.4% 0+0k 180672+52736io 5pf+0w
>>>>>
>>>>>    lapw1 -it -p   -c (16:14:10) starting parallel lapw1 at Tue May 27
>>>>> 16:14:10 BRT 2014
>>>>
>>>> ->  starting parallel LAPW1 jobs at Tue May 27 16:14:10 BRT 2014
>>>> running LAPW1 in parallel mode (using .machines)
>>>> 12 number_of_parallel_jobs
>>>>       r1i0n15(1)      r1i0n15(1)      r1i0n15(1)      r1i0n15(1)
>>>> r1i0n15(1)      r1i0n15(1)      r1i0n15(1)      r1i0n15(1)
>>>> r1i0n15(1)
>>>> r1i0n15(1)      r1i0n15(1)      r1i0n15(1)    Summary of lapw1para:
>>>>     r1i0n15 k=1 user=0 wallclock=1
>>>> 0.132u 0.136s 0:04.75 5.4% 0+0k 4104+1688io 5pf+0w
>>>>>
>>>>>    lapw2 -p   -c   (16:14:15) running LAPW2 in parallel mode
>>>>
>>>> **  LAPW2 crashed!
>>>> 0.396u 0.016s 0:00.66 60.6% 0+0k 6424+11472io 1pf+0w
>>>> error: command
>>>> /home/ice/proj/proj546/ogando/Wien/Executaveis-13-OpenMPI/lapw2cpara -c
>>>> lapw2.def   failed
>>>>
>>>>>    stop error
>>>>
>>>> ===========================================================
>>>> lapw2.error (the only non empty case.error)
>>>>
>>>> Error in LAPW2
>>>>   'LAPW2' - can't open unit: 30
>>>>   'LAPW2' -        filename: InPzb15InPwurt3-V2.energy_1
>>>> **  testerror: Error in Parallel LAPW2
>>>>
>>>> ===========================================================
>>>> The standard output file
>>>>
>>>>
>>>> OMP_NUM_THREADS =  12
>>>>
>>>> -----------------------------------------
>>>> Inicio do job: Tue May 27 16:07:00 BRT 2014
>>>> Hostname:  r1i0n15
>>>> PWD:
>>>>
>>>> /home/ice/proj/proj546/ogando/Wien/Calculos/InP/InPzbInPwurt/15camadasZB+3WZ/InPzb15InPwurt3-V2
>>>> 0.000u 0.000s 0:00.05 0.0% 0+0k 8216+24io 1pf+0w
>>>>   LAPW0 END
>>>>   LAPW0 END
>>>>   LAPW0 END
>>>>   LAPW0 END
>>>>   LAPW0 END
>>>>   LAPW0 END
>>>>   LAPW0 END
>>>>   LAPW0 END
>>>>   LAPW0 END
>>>>   LAPW0 END
>>>>   LAPW0 END
>>>>   LAPW0 END
>>>> grep: .processes: No such file or directory
>>>> InPzb15InPwurt3-V2.scf1_1: No such file or directory.
>>>> grep: No match.
>>>> FERMI - Error
>>>> cp: cannot stat `.in.tmp': No such file or directory
>>>>
>>>>>    stop error
>>>>
>>>> Final do job: Tue May 27 16:14:15 BRT 2014
>>>> -----------------------------------------
>>>>
>>>> OMP_NUM_THREADS =  12
>>>>
>>>> =======================================
>>>> My parallel_options file
>>>>
>>>> setenv TASKSET "no"
>>>> setenv USE_REMOTE 1
>>>> setenv MPI_REMOTE 0
>>>> setenv WIEN_GRANULARITY 1
>>>> setenv WIEN_MPIRUN "/home/ice/proj/proj546/ogando/OpenMPIexec/bin/mpirun
>>>> -np
>>>> _NP_ -machinefile _HOSTS_ _EXEC_"
>>>>
>>>>
>>>
>>>
>>
>

-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
Co-Editor, Acta Cryst A
"Research is to see what everybody else has seen, and to think what
nobody else has thought"
Albert Szent-Gyorgi