[Wien] "remotemachine: Undefined variable." coming from lapw1para_lapw and lapw2para_lapw

Laurence Marks L-marks at northwestern.edu
Sun Sep 23 18:36:27 CEST 2007


Three things:

1) Please diff the version of lapw1para_lapw & lapw2para_lapw that are
in a freshly loaded $WIENROOT/SRC and the one you have in $WIENROOT to
ensure that they are the same. The current version has the definition

                 set remotemachine = `head -1 .machine[$p]`

At least on my system even if the file .machine1 (for instance) does
not exist remotemachine is set to nothing so it is hard to understand
your finding it as an undefined variable.

2) Change what is in $WIENROOT/parallel_options to whatever is correct
for your system -- what is in this file is only supposed to be a
general framework, not something that works for all systems.

3) Look at the FAQ, http://www.wien2k.at/reg_user/faq/pbs.html for
hints to see how to configure for pbs or a similar system.

On 9/23/07, Steven Hahn <shahn at iastate.edu> wrote:
> Yes, I replaced SRC.tar.gz with the file from the website before
> running expand_lapw. Reconfiguring and compiling in a new directory
> (and changing my .bashrc settings) was simply to keep things as clean
> as possible. For the time being let's assume I have latest files.
>
> The machine I am using is a 64 node opteron cluster with 2 dual-core
> processors per node. It uses openmpi 1.2.3 and openpbs (not sure of
> the version) for parallel execution and scheduling. I have also
> verified that passwordless ssh works between nodes. In "configure
> parallel execution" I said no to "Shared memory architecture, and ssh
> to "Remote shell". The "mpirun command" is mpirun -np _NP_ -
> machinefile _HOSTS_ _EXEC_.
>
> I have already carefully analyzed the lapw1para file, and in my first
> message included the line numbers of the problem as well as a
> potential workaround. The problem is that remotemachine is never
> defined in lapw1para_lapw. My concern is both the cause of the
> original error and in the correctness of this "fix."
>
> In the interest of testing, I tried reconfiguring WIEN2k for shared
> memory architecture, and ran on only one node. I also tried
> configuring and compiling the older 7.2 version with the same
> parallel setting that I give above for a distributed memory machine.
> In both cases I received the same error message from openmpi:
>
> [shahn at node050 test_case]$ x lapw1 -p -c
> starting parallel lapw1 at Sun Sep 23 00:50:02 CDT 2007
> ->  starting parallel LAPW1 jobs at Sun Sep 23 00:50:02 CDT 2007
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
> [1] 27569
> [node050:27571] pls:tm: failed to poll for a spawned proc, return
> status = 17002
> [node050:27571] [0,0,0] ORTE_ERROR_LOG: In errno in file rmgr_urm.c
> at line 462
> [node050:27571] mpirun: spawn failed with errno=-11
> [1]  + Done                          ( cd $PWD; $t $ttt; rm -f .lock_
> $lockfile[$p] ) >> .time1_$loop
>       node050 node050 node050 node050(2) 0.035u 0.021s 0:00.13
> 38.4%     0+0k 0+0io 0pf+0w
> **  LAPW1 crashed!
> cat: No match.
> 0.070u 0.252s 0:03.66 8.7%      0+0k 0+0io 0pf+0w
> error: command   /home/shahn/software/WIEN2k_073/lapw1cpara -c
> lapw1.def   failed
>
> If I bypass the scheduler and ssh directiy into the node this same
> command completes without errors. Investigating this error message I
> found that openmpi currently does not support the -machinefile option
> in our enviroment. While we may call it a bug, openmpi considers it
> to be a feature. There is a lengthy discussion at the following
> website(http://www.open-mpi.org/community/lists/users/
> 2007/05/3184.php) Unfortunately, I don't have access to a machine
> with either a different flavor of mpi or a different scheduler. The
> simple workaround to this problem appears to be using version 7.3,
> which I assume calls mpirun as a remote command. However, that setup
> gives the error "remotemachine: Undefined variable" that this thread
> is all about.
>
> Steve
>
> On Sep 22, 2007, at 2:58 PM, Laurence Marks wrote:
>
> > My email got trapped by Wien2k listserver's size limit, so I am
> > resending.
> >
> > This is not a compilation issue, it is whether:
> > a) You have the correct lapw1para_lapw & lapw2para_lapw
> > b) You have correctly setup remote execution for your system
> >
> > There was a bug with an incorrect version in SRC which has been
> > corrected; the ones on the web work fine.
> >
> > If these do not work for you, please check that the "remote" variable
> > is set correctly, the "configure Parallel execution" part of
> > siteconfig.
> >
> > If you still are not getting anywhere, please add some debug lines to
> > whichever of lapw1para_lapw or lapw2para_lapw is giving problems so
> > you can trace where the issue is.
> >
> > N.B., to be completely clear lapw1para_lapw and lapw2para_lapw are
> > copied from SRC during the installation, so if you replace SRC you
> > have of course to do this yourself or use the Wien2k install scripts
> > to do it for you.
> >
> > On 9/22/07, Steven Hahn <shahn at iastate.edu> wrote:
> >> I tried twice to compile in an empty directory and replacing
> >> SRC.tar.gz with the file from the web, but each time ran into the
> >> same problem I describe below. The latest WIEN2k_07.tar.gz and
> >> SRC.tar.gz from the website are dated August 17, 2007. Is this still
> >> the erroneous version? Would someone be willing to doublecheck that
> >> the latest source on the website runs correctly on their system?
> >>
> >> Steve
> >>
> >> On Sep 21, 2007, at 7:23 PM, Laurence Marks wrote:
> >>
> >>> I said SRC, i.e. SRC.tar.gz -- this is where lapw1para is
> >>> originally.
> >>>
> >>> On 9/21/07, Steven Hahn <shahn at iastate.edu> wrote:
> >>>> Thank you for your prompt reply and suggestion. I just downloaded
> >>>> SRC_lapw1.tar.gz from the website and diff shows it to be identical
> >>>> to the same file in WIEN2k_07.tar. I tried recompiling anyway, but
> >>>> received the same error message.
> >>>> On Sep 21, 2007, at 5:02 PM, Laurence Marks wrote:
> >>>>
> >>>>> Check lapw1para in SRC of what is currently on the web, i.e.
> >>>>> download
> >>>>> just that directory -- there was an erroneous version which I
> >>>>> believe
> >>>>> was corrected about a week ago
> >>>>>
> >>>>> On 9/21/07, Steven Hahn <shahn at iastate.edu> wrote:
> >>>>>> Dear all,
> >>>>>>
> >>>>>> I am trying to setup the fine-grain parallel (mpi) version of
> >>>>>> WIEN2k
> >>>>>> 7.3 on our cluster. I successfully compiled the code, but
> >>>>>> received an
> >>>>>> error "remotemachine: Undefined variable." when executing "x
> >>>>>> lapw1 -p
> >>>>>> -c" on the test_case benchmark. Investigating this problem I
> >>>>>> found
> >>>>>> that adding  "set remotemachine = $machine[$p]" before line
> >>>>>> 497 of
> >>>>>> lapw1para_lapw allows the benchmark to complete. Testing the full
> >>>>>> iteration with run_lapw on a different case, I had to add the
> >>>>>> same
> >>>>>> line before line 315 of lapw2para_lapw for WIEN2k to finish
> >>>>>> without
> >>>>>> errors.
> >>>>>>
> >>>>>> I've tried recompiling everything a second time from the tar
> >>>>>> file,
> >>>>>> but the problem persists. I did notice that line 496 of
> >>>>>> lapw1para_lapw and line 314 of lapw2para_lapw (set
> >>>>>> remotemachine =
> >>>>>> `head -1 .machine[$p]`) is missing after the compilation. Adding
> >>>>>> this
> >>>>>> line by hand gives me a new error message (set: Variable name
> >>>>>> must
> >>>>>> begin with a letter). This problem is quickly spiraling beyond my
> >>>>>> familiarity with the program. Have others had problems with
> >>>>>> the mpi
> >>>>>> parallel code? Is this a bug in the code, and if not what
> >>>>>> setting do
> >>>>>> I need to change? Is the workaround described above correct,
> >>>>>> or are
> >>>>>> there other files I need to change for proper operation?
> >>>>>>
> >>>>>> Steven
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Wien mailing list
> >>>>>> Wien at zeus.theochem.tuwien.ac.at
> >>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Laurence Marks
> >>>>> Department of Materials Science and Engineering
> >>>>> MSE Rm 2036 Cook Hall
> >>>>> 2220 N Campus Drive
> >>>>> Northwestern University
> >>>>> Evanston, IL 60208, USA
> >>>>> Tel: (847) 491-3996 Fax: (847) 491-7820
> >>>>> email: L-marks at northwestern dot edu
> >>>>> Web: www.numis.northwestern.edu
> >>>>> Commission on Electron Diffraction of IUCR
> >>>>> www.numis.northwestern.edu/IUCR_CED
> >>>>> _______________________________________________
> >>>>> Wien mailing list
> >>>>> Wien at zeus.theochem.tuwien.ac.at
> >>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >>>>>
> >>>>
> >>>> _______________________________________________
> >>>> Wien mailing list
> >>>> Wien at zeus.theochem.tuwien.ac.at
> >>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >>>>
> >>>
> >>>
> >>> --
> >>> Laurence Marks
> >>> Department of Materials Science and Engineering
> >>> MSE Rm 2036 Cook Hall
> >>> 2220 N Campus Drive
> >>> Northwestern University
> >>> Evanston, IL 60208, USA
> >>> Tel: (847) 491-3996 Fax: (847) 491-7820
> >>> email: L-marks at northwestern dot edu
> >>> Web: www.numis.northwestern.edu
> >>> Commission on Electron Diffraction of IUCR
> >>> www.numis.northwestern.edu/IUCR_CED
> >>> _______________________________________________
> >>> Wien mailing list
> >>> Wien at zeus.theochem.tuwien.ac.at
> >>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >>>
> >>
> >> _______________________________________________
> >> Wien mailing list
> >> Wien at zeus.theochem.tuwien.ac.at
> >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >>
> >
> >
> > --
> > Laurence Marks
> > Department of Materials Science and Engineering
> > MSE Rm 2036 Cook Hall
> > 2220 N Campus Drive
> > Northwestern University
> > Evanston, IL 60208, USA
> > Tel: (847) 491-3996 Fax: (847) 491-7820
> > email: L-marks at northwestern dot edu
> > Web: www.numis.northwestern.edu
> > Commission on Electron Diffraction of IUCR
> > www.numis.northwestern.edu/IUCR_CED
> > _______________________________________________
> > Wien mailing list
> > Wien at zeus.theochem.tuwien.ac.at
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>


-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Commission on Electron Diffraction of IUCR
www.numis.northwestern.edu/IUCR_CED


More information about the Wien mailing list