[Wien] "remotemachine: Undefined variable." coming from lapw1para_lapw and lapw2para_lapw

Laurence Marks L-marks at northwestern.edu
Sun Sep 23 19:57:34 CEST 2007


Peter is probably the right person to respond about siteconfig_lapw --
I've never looked at it.

On 9/23/07, Steven Hahn <shahn at iastate.edu> wrote:
> 1)The problem is in siteconfig_lapw. If I manually copy
> lapw1para_lapw and lapw2para_lapw from $WIENROOT/SRC to $WIENROOT, "x
> lapw1 -p -c" and "run_lapw -p -i 1" complete normally. If I run ./
> siteconfig_lapw again and "Configure parallel execution", the error
> message about remotemachine returns. This ONLY happens with
> USE_REMOTE 1. If I set USE_REMOTE 0 I get the mpi error message
> discussed in my previous message.
>
> I believe the problem is lines 868 and 871 of siteconfig_lapw. If I
> type this line into the commandline and compare files I get the
> following result below. Note that this command is removing
> remotemachine!!!
>
> [shahn at cmp-cluster WIEN2k_073]$     sed -e "s/set remote.*"'$'"/set
> remote = $input/" <lapw1para_lapw >tmp
> [shahn at cmp-cluster WIEN2k_073]$ diff lapw1para_lapw tmp
> 31c31
> < set remote = ssh
> ---
>  > set remote =
> 496c496
> <                  set remotemachine = `head -1 .machine[$p]`
> ---
>  >                  set remote =
> [shahn at cmp-cluster WIEN2k_073]$
>
>
> 2) Here's my parallel_options file:
> [shahn at cmp-cluster WIEN2k_073]$ cat parallel_options
> setenv USE_REMOTE 1
> setenv WIEN_GRANULARITY 1
> setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"
>
>
> 3) Right now I'm running everything as an interactive job to avoid
> the need to create a .machines file on the fly. Once WIEN2k is
> running properly, I'll work on the necessary script.
>
>
> On Sep 23, 2007, at 11:36 AM, Laurence Marks wrote:
>
> > Three things:
> >
> > 1) Please diff the version of lapw1para_lapw & lapw2para_lapw that are
> > in a freshly loaded $WIENROOT/SRC and the one you have in $WIENROOT to
> > ensure that they are the same. The current version has the definition
> >
> >                  set remotemachine = `head -1 .machine[$p]`
> >
> > At least on my system even if the file .machine1 (for instance) does
> > not exist remotemachine is set to nothing so it is hard to understand
> > your finding it as an undefined variable.
> >
> > 2) Change what is in $WIENROOT/parallel_options to whatever is correct
> > for your system -- what is in this file is only supposed to be a
> > general framework, not something that works for all systems.
> >
> > 3) Look at the FAQ, http://www.wien2k.at/reg_user/faq/pbs.html for
> > hints to see how to configure for pbs or a similar system.
> >
> > On 9/23/07, Steven Hahn <shahn at iastate.edu> wrote:
> >> Yes, I replaced SRC.tar.gz with the file from the website before
> >> running expand_lapw. Reconfiguring and compiling in a new directory
> >> (and changing my .bashrc settings) was simply to keep things as clean
> >> as possible. For the time being let's assume I have latest files.
> >>
> >> The machine I am using is a 64 node opteron cluster with 2 dual-core
> >> processors per node. It uses openmpi 1.2.3 and openpbs (not sure of
> >> the version) for parallel execution and scheduling. I have also
> >> verified that passwordless ssh works between nodes. In "configure
> >> parallel execution" I said no to "Shared memory architecture, and ssh
> >> to "Remote shell". The "mpirun command" is mpirun -np _NP_ -
> >> machinefile _HOSTS_ _EXEC_.
> >>
> >> I have already carefully analyzed the lapw1para file, and in my first
> >> message included the line numbers of the problem as well as a
> >> potential workaround. The problem is that remotemachine is never
> >> defined in lapw1para_lapw. My concern is both the cause of the
> >> original error and in the correctness of this "fix."
> >>
> >> In the interest of testing, I tried reconfiguring WIEN2k for shared
> >> memory architecture, and ran on only one node. I also tried
> >> configuring and compiling the older 7.2 version with the same
> >> parallel setting that I give above for a distributed memory machine.
> >> In both cases I received the same error message from openmpi:
> >>
> >> [shahn at node050 test_case]$ x lapw1 -p -c
> >> starting parallel lapw1 at Sun Sep 23 00:50:02 CDT 2007
> >> ->  starting parallel LAPW1 jobs at Sun Sep 23 00:50:02 CDT 2007
> >> running LAPW1 in parallel mode (using .machines)
> >> 1 number_of_parallel_jobs
> >> [1] 27569
> >> [node050:27571] pls:tm: failed to poll for a spawned proc, return
> >> status = 17002
> >> [node050:27571] [0,0,0] ORTE_ERROR_LOG: In errno in file rmgr_urm.c
> >> at line 462
> >> [node050:27571] mpirun: spawn failed with errno=-11
> >> [1]  + Done                          ( cd $PWD; $t $ttt; rm -f .lock_
> >> $lockfile[$p] ) >> .time1_$loop
> >>       node050 node050 node050 node050(2) 0.035u 0.021s 0:00.13
> >> 38.4%     0+0k 0+0io 0pf+0w
> >> **  LAPW1 crashed!
> >> cat: No match.
> >> 0.070u 0.252s 0:03.66 8.7%      0+0k 0+0io 0pf+0w
> >> error: command   /home/shahn/software/WIEN2k_073/lapw1cpara -c
> >> lapw1.def   failed
> >>
> >> If I bypass the scheduler and ssh directiy into the node this same
> >> command completes without errors. Investigating this error message I
> >> found that openmpi currently does not support the -machinefile option
> >> in our enviroment. While we may call it a bug, openmpi considers it
> >> to be a feature. There is a lengthy discussion at the following
> >> website(http://www.open-mpi.org/community/lists/users/
> >> 2007/05/3184.php) Unfortunately, I don't have access to a machine
> >> with either a different flavor of mpi or a different scheduler. The
> >> simple workaround to this problem appears to be using version 7.3,
> >> which I assume calls mpirun as a remote command. However, that setup
> >> gives the error "remotemachine: Undefined variable" that this thread
> >> is all about.
> >>
> >> Steve
> >>
> >> On Sep 22, 2007, at 2:58 PM, Laurence Marks wrote:
> >>
> >>> My email got trapped by Wien2k listserver's size limit, so I am
> >>> resending.
> >>>
> >>> This is not a compilation issue, it is whether:
> >>> a) You have the correct lapw1para_lapw & lapw2para_lapw
> >>> b) You have correctly setup remote execution for your system
> >>>
> >>> There was a bug with an incorrect version in SRC which has been
> >>> corrected; the ones on the web work fine.
> >>>
> >>> If these do not work for you, please check that the "remote"
> >>> variable
> >>> is set correctly, the "configure Parallel execution" part of
> >>> siteconfig.
> >>>
> >>> If you still are not getting anywhere, please add some debug
> >>> lines to
> >>> whichever of lapw1para_lapw or lapw2para_lapw is giving problems so
> >>> you can trace where the issue is.
> >>>
> >>> N.B., to be completely clear lapw1para_lapw and lapw2para_lapw are
> >>> copied from SRC during the installation, so if you replace SRC you
> >>> have of course to do this yourself or use the Wien2k install scripts
> >>> to do it for you.
> >>>
> >>> On 9/22/07, Steven Hahn <shahn at iastate.edu> wrote:
> >>>> I tried twice to compile in an empty directory and replacing
> >>>> SRC.tar.gz with the file from the web, but each time ran into the
> >>>> same problem I describe below. The latest WIEN2k_07.tar.gz and
> >>>> SRC.tar.gz from the website are dated August 17, 2007. Is this
> >>>> still
> >>>> the erroneous version? Would someone be willing to doublecheck that
> >>>> the latest source on the website runs correctly on their system?
> >>>>
> >>>> Steve
> >>>>
> >>>> On Sep 21, 2007, at 7:23 PM, Laurence Marks wrote:
> >>>>
> >>>>> I said SRC, i.e. SRC.tar.gz -- this is where lapw1para is
> >>>>> originally.
> >>>>>
> >>>>> On 9/21/07, Steven Hahn <shahn at iastate.edu> wrote:
> >>>>>> Thank you for your prompt reply and suggestion. I just downloaded
> >>>>>> SRC_lapw1.tar.gz from the website and diff shows it to be
> >>>>>> identical
> >>>>>> to the same file in WIEN2k_07.tar. I tried recompiling anyway,
> >>>>>> but
> >>>>>> received the same error message.
> >>>>>> On Sep 21, 2007, at 5:02 PM, Laurence Marks wrote:
> >>>>>>
> >>>>>>> Check lapw1para in SRC of what is currently on the web, i.e.
> >>>>>>> download
> >>>>>>> just that directory -- there was an erroneous version which I
> >>>>>>> believe
> >>>>>>> was corrected about a week ago
> >>>>>>>
> >>>>>>> On 9/21/07, Steven Hahn <shahn at iastate.edu> wrote:
> >>>>>>>> Dear all,
> >>>>>>>>
> >>>>>>>> I am trying to setup the fine-grain parallel (mpi) version of
> >>>>>>>> WIEN2k
> >>>>>>>> 7.3 on our cluster. I successfully compiled the code, but
> >>>>>>>> received an
> >>>>>>>> error "remotemachine: Undefined variable." when executing "x
> >>>>>>>> lapw1 -p
> >>>>>>>> -c" on the test_case benchmark. Investigating this problem I
> >>>>>>>> found
> >>>>>>>> that adding  "set remotemachine = $machine[$p]" before line
> >>>>>>>> 497 of
> >>>>>>>> lapw1para_lapw allows the benchmark to complete. Testing the
> >>>>>>>> full
> >>>>>>>> iteration with run_lapw on a different case, I had to add the
> >>>>>>>> same
> >>>>>>>> line before line 315 of lapw2para_lapw for WIEN2k to finish
> >>>>>>>> without
> >>>>>>>> errors.
> >>>>>>>>
> >>>>>>>> I've tried recompiling everything a second time from the tar
> >>>>>>>> file,
> >>>>>>>> but the problem persists. I did notice that line 496 of
> >>>>>>>> lapw1para_lapw and line 314 of lapw2para_lapw (set
> >>>>>>>> remotemachine =
> >>>>>>>> `head -1 .machine[$p]`) is missing after the compilation.
> >>>>>>>> Adding
> >>>>>>>> this
> >>>>>>>> line by hand gives me a new error message (set: Variable name
> >>>>>>>> must
> >>>>>>>> begin with a letter). This problem is quickly spiraling
> >>>>>>>> beyond my
> >>>>>>>> familiarity with the program. Have others had problems with
> >>>>>>>> the mpi
> >>>>>>>> parallel code? Is this a bug in the code, and if not what
> >>>>>>>> setting do
> >>>>>>>> I need to change? Is the workaround described above correct,
> >>>>>>>> or are
> >>>>>>>> there other files I need to change for proper operation?
> >>>>>>>>
> >>>>>>>> Steven
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> Wien mailing list
> >>>>>>>> Wien at zeus.theochem.tuwien.ac.at
> >>>>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Laurence Marks
> >>>>>>> Department of Materials Science and Engineering
> >>>>>>> MSE Rm 2036 Cook Hall
> >>>>>>> 2220 N Campus Drive
> >>>>>>> Northwestern University
> >>>>>>> Evanston, IL 60208, USA
> >>>>>>> Tel: (847) 491-3996 Fax: (847) 491-7820
> >>>>>>> email: L-marks at northwestern dot edu
> >>>>>>> Web: www.numis.northwestern.edu
> >>>>>>> Commission on Electron Diffraction of IUCR
> >>>>>>> www.numis.northwestern.edu/IUCR_CED
> >>>>>>> _______________________________________________
> >>>>>>> Wien mailing list
> >>>>>>> Wien at zeus.theochem.tuwien.ac.at
> >>>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >>>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Wien mailing list
> >>>>>> Wien at zeus.theochem.tuwien.ac.at
> >>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Laurence Marks
> >>>>> Department of Materials Science and Engineering
> >>>>> MSE Rm 2036 Cook Hall
> >>>>> 2220 N Campus Drive
> >>>>> Northwestern University
> >>>>> Evanston, IL 60208, USA
> >>>>> Tel: (847) 491-3996 Fax: (847) 491-7820
> >>>>> email: L-marks at northwestern dot edu
> >>>>> Web: www.numis.northwestern.edu
> >>>>> Commission on Electron Diffraction of IUCR
> >>>>> www.numis.northwestern.edu/IUCR_CED
> >>>>> _______________________________________________
> >>>>> Wien mailing list
> >>>>> Wien at zeus.theochem.tuwien.ac.at
> >>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >>>>>
> >>>>
> >>>> _______________________________________________
> >>>> Wien mailing list
> >>>> Wien at zeus.theochem.tuwien.ac.at
> >>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >>>>
> >>>
> >>>
> >>> --
> >>> Laurence Marks
> >>> Department of Materials Science and Engineering
> >>> MSE Rm 2036 Cook Hall
> >>> 2220 N Campus Drive
> >>> Northwestern University
> >>> Evanston, IL 60208, USA
> >>> Tel: (847) 491-3996 Fax: (847) 491-7820
> >>> email: L-marks at northwestern dot edu
> >>> Web: www.numis.northwestern.edu
> >>> Commission on Electron Diffraction of IUCR
> >>> www.numis.northwestern.edu/IUCR_CED
> >>> _______________________________________________
> >>> Wien mailing list
> >>> Wien at zeus.theochem.tuwien.ac.at
> >>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >>>
> >>
> >> _______________________________________________
> >> Wien mailing list
> >> Wien at zeus.theochem.tuwien.ac.at
> >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >>
> >
> >
> > --
> > Laurence Marks
> > Department of Materials Science and Engineering
> > MSE Rm 2036 Cook Hall
> > 2220 N Campus Drive
> > Northwestern University
> > Evanston, IL 60208, USA
> > Tel: (847) 491-3996 Fax: (847) 491-7820
> > email: L-marks at northwestern dot edu
> > Web: www.numis.northwestern.edu
> > Commission on Electron Diffraction of IUCR
> > www.numis.northwestern.edu/IUCR_CED
> > _______________________________________________
> > Wien mailing list
> > Wien at zeus.theochem.tuwien.ac.at
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>


-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Commission on Electron Diffraction of IUCR
www.numis.northwestern.edu/IUCR_CED


More information about the Wien mailing list