[Wien] "remotemachine: Undefined variable." coming from lapw1para_lapw and lapw2para_lapw
Peter Blaha
pblaha at theochem.tuwien.ac.at
Sun Sep 23 20:52:51 CEST 2007
Hi,
You are right, siteconfig removes the remotemachie line. I never tried
changing "remote" since this new variable remotemachine was introduced.
The fix for siteconfig is simple to add a blank before the ".*", i.e.
sed -e "s/set remote .*"'$'"/set remote = $input/" <lapw1para_lapw >tmp
Thanks for the report.
You should set into WIEN_MPIRUN what ever your mpi-version requires.
If it does not support "-machinefile file", don't use it, i.e. remove
this option.
On the other hand, somehow also openmpi will get the info on which nodes
it should run, and you have to provide this information (either with
another switch to mpirun or by creating a default file,....
setenv WIEN_MPIRUN "mpirun -np _NP_ _EXEC_"
Steven Hahn schrieb:
> 1)The problem is in siteconfig_lapw. If I manually copy
> lapw1para_lapw and lapw2para_lapw from $WIENROOT/SRC to $WIENROOT, "x
> lapw1 -p -c" and "run_lapw -p -i 1" complete normally. If I run ./
> siteconfig_lapw again and "Configure parallel execution", the error
> message about remotemachine returns. This ONLY happens with
> USE_REMOTE 1. If I set USE_REMOTE 0 I get the mpi error message
> discussed in my previous message.
>
> I believe the problem is lines 868 and 871 of siteconfig_lapw. If I
> type this line into the commandline and compare files I get the
> following result below. Note that this command is removing
> remotemachine!!!
>
> [shahn at cmp-cluster WIEN2k_073]$ sed -e "s/set remote.*"'$'"/set
> remote = $input/" <lapw1para_lapw >tmpsed -e "s/set remote.*"'$'"/set
> remote = $input/" <lapw1para_lapw >tmp
> [shahn at cmp-cluster WIEN2k_073]$ diff lapw1para_lapw tmp
> 31c31
> < set remote = ssh
> ---
> > set remote =
> 496c496
> < set remotemachine = `head -1 .machine[$p]`
> ---
> > set remote =
> [shahn at cmp-cluster WIEN2k_073]$
>
>
> 2) Here's my parallel_options file:
> [shahn at cmp-cluster WIEN2k_073]$ cat parallel_options
> setenv USE_REMOTE 1
> setenv WIEN_GRANULARITY 1
> setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"
>
>
> 3) Right now I'm running everything as an interactive job to avoid
> the need to create a .machines file on the fly. Once WIEN2k is
> running properly, I'll work on the necessary script.
>
>
> On Sep 23, 2007, at 11:36 AM, Laurence Marks wrote:
>
>> Three things:
>>
>> 1) Please diff the version of lapw1para_lapw & lapw2para_lapw that are
>> in a freshly loaded $WIENROOT/SRC and the one you have in $WIENROOT to
>> ensure that they are the same. The current version has the definition
>>
>> set remotemachine = `head -1 .machine[$p]`
>>
>> At least on my system even if the file .machine1 (for instance) does
>> not exist remotemachine is set to nothing so it is hard to understand
>> your finding it as an undefined variable.
>>
>> 2) Change what is in $WIENROOT/parallel_options to whatever is correct
>> for your system -- what is in this file is only supposed to be a
>> general framework, not something that works for all systems.
>>
>> 3) Look at the FAQ, http://www.wien2k.at/reg_user/faq/pbs.html for
>> hints to see how to configure for pbs or a similar system.
>>
>> On 9/23/07, Steven Hahn <shahn at iastate.edu> wrote:
>>> Yes, I replaced SRC.tar.gz with the file from the website before
>>> running expand_lapw. Reconfiguring and compiling in a new directory
>>> (and changing my .bashrc settings) was simply to keep things as clean
>>> as possible. For the time being let's assume I have latest files.
>>>
>>> The machine I am using is a 64 node opteron cluster with 2 dual-core
>>> processors per node. It uses openmpi 1.2.3 and openpbs (not sure of
>>> the version) for parallel execution and scheduling. I have also
>>> verified that passwordless ssh works between nodes. In "configure
>>> parallel execution" I said no to "Shared memory architecture, and ssh
>>> to "Remote shell". The "mpirun command" is mpirun -np _NP_ -
>>> machinefile _HOSTS_ _EXEC_.
>>>
>>> I have already carefully analyzed the lapw1para file, and in my first
>>> message included the line numbers of the problem as well as a
>>> potential workaround. The problem is that remotemachine is never
>>> defined in lapw1para_lapw. My concern is both the cause of the
>>> original error and in the correctness of this "fix."
>>>
>>> In the interest of testing, I tried reconfiguring WIEN2k for shared
>>> memory architecture, and ran on only one node. I also tried
>>> configuring and compiling the older 7.2 version with the same
>>> parallel setting that I give above for a distributed memory machine.
>>> In both cases I received the same error message from openmpi:
>>>
>>> [shahn at node050 test_case]$ x lapw1 -p -c
>>> starting parallel lapw1 at Sun Sep 23 00:50:02 CDT 2007
>>> -> starting parallel LAPW1 jobs at Sun Sep 23 00:50:02 CDT 2007
>>> running LAPW1 in parallel mode (using .machines)
>>> 1 number_of_parallel_jobs
>>> [1] 27569
>>> [node050:27571] pls:tm: failed to poll for a spawned proc, return
>>> status = 17002
>>> [node050:27571] [0,0,0] ORTE_ERROR_LOG: In errno in file rmgr_urm.c
>>> at line 462
>>> [node050:27571] mpirun: spawn failed with errno=-11
>>> [1] + Done ( cd $PWD; $t $ttt; rm -f .lock_
>>> $lockfile[$p] ) >> .time1_$loop
>>> node050 node050 node050 node050(2) 0.035u 0.021s 0:00.13
>>> 38.4% 0+0k 0+0io 0pf+0w
>>> ** LAPW1 crashed!
>>> cat: No match.
>>> 0.070u 0.252s 0:03.66 8.7% 0+0k 0+0io 0pf+0w
>>> error: command /home/shahn/software/WIEN2k_073/lapw1cpara -c
>>> lapw1.def failed
>>>
>>> If I bypass the scheduler and ssh directiy into the node this same
>>> command completes without errors. Investigating this error message I
>>> found that openmpi currently does not support the -machinefile option
>>> in our enviroment. While we may call it a bug, openmpi considers it
>>> to be a feature. There is a lengthy discussion at the following
>>> website(http://www.open-mpi.org/community/lists/users/
>>> 2007/05/3184.php) Unfortunately, I don't have access to a machine
>>> with either a different flavor of mpi or a different scheduler. The
>>> simple workaround to this problem appears to be using version 7.3,
>>> which I assume calls mpirun as a remote command. However, that setup
>>> gives the error "remotemachine: Undefined variable" that this thread
>>> is all about.
>>>
>>> Steve
>>>
>>> On Sep 22, 2007, at 2:58 PM, Laurence Marks wrote:
>>>
>>>> My email got trapped by Wien2k listserver's size limit, so I am
>>>> resending.
>>>>
>>>> This is not a compilation issue, it is whether:
>>>> a) You have the correct lapw1para_lapw & lapw2para_lapw
>>>> b) You have correctly setup remote execution for your system
>>>>
>>>> There was a bug with an incorrect version in SRC which has been
>>>> corrected; the ones on the web work fine.
>>>>
>>>> If these do not work for you, please check that the "remote"
>>>> variable
>>>> is set correctly, the "configure Parallel execution" part of
>>>> siteconfig.
>>>>
>>>> If you still are not getting anywhere, please add some debug
>>>> lines to
>>>> whichever of lapw1para_lapw or lapw2para_lapw is giving problems so
>>>> you can trace where the issue is.
>>>>
>>>> N.B., to be completely clear lapw1para_lapw and lapw2para_lapw are
>>>> copied from SRC during the installation, so if you replace SRC you
>>>> have of course to do this yourself or use the Wien2k install scripts
>>>> to do it for you.
>>>>
>>>> On 9/22/07, Steven Hahn <shahn at iastate.edu> wrote:
>>>>> I tried twice to compile in an empty directory and replacing
>>>>> SRC.tar.gz with the file from the web, but each time ran into the
>>>>> same problem I describe below. The latest WIEN2k_07.tar.gz and
>>>>> SRC.tar.gz from the website are dated August 17, 2007. Is this
>>>>> still
>>>>> the erroneous version? Would someone be willing to doublecheck that
>>>>> the latest source on the website runs correctly on their system?
>>>>>
>>>>> Steve
>>>>>
>>>>> On Sep 21, 2007, at 7:23 PM, Laurence Marks wrote:
>>>>>
>>>>>> I said SRC, i.e. SRC.tar.gz -- this is where lapw1para is
>>>>>> originally.
>>>>>>
>>>>>> On 9/21/07, Steven Hahn <shahn at iastate.edu> wrote:
>>>>>>> Thank you for your prompt reply and suggestion. I just downloaded
>>>>>>> SRC_lapw1.tar.gz from the website and diff shows it to be
>>>>>>> identical
>>>>>>> to the same file in WIEN2k_07.tar. I tried recompiling anyway,
>>>>>>> but
>>>>>>> received the same error message.
>>>>>>> On Sep 21, 2007, at 5:02 PM, Laurence Marks wrote:
>>>>>>>
>>>>>>>> Check lapw1para in SRC of what is currently on the web, i.e.
>>>>>>>> download
>>>>>>>> just that directory -- there was an erroneous version which I
>>>>>>>> believe
>>>>>>>> was corrected about a week ago
>>>>>>>>
>>>>>>>> On 9/21/07, Steven Hahn <shahn at iastate.edu> wrote:
>>>>>>>>> Dear all,
>>>>>>>>>
>>>>>>>>> I am trying to setup the fine-grain parallel (mpi) version of
>>>>>>>>> WIEN2k
>>>>>>>>> 7.3 on our cluster. I successfully compiled the code, but
>>>>>>>>> received an
>>>>>>>>> error "remotemachine: Undefined variable." when executing "x
>>>>>>>>> lapw1 -p
>>>>>>>>> -c" on the test_case benchmark. Investigating this problem I
>>>>>>>>> found
>>>>>>>>> that adding "set remotemachine = $machine[$p]" before line
>>>>>>>>> 497 of
>>>>>>>>> lapw1para_lapw allows the benchmark to complete. Testing the
>>>>>>>>> full
>>>>>>>>> iteration with run_lapw on a different case, I had to add the
>>>>>>>>> same
>>>>>>>>> line before line 315 of lapw2para_lapw for WIEN2k to finish
>>>>>>>>> without
>>>>>>>>> errors.
>>>>>>>>>
>>>>>>>>> I've tried recompiling everything a second time from the tar
>>>>>>>>> file,
>>>>>>>>> but the problem persists. I did notice that line 496 of
>>>>>>>>> lapw1para_lapw and line 314 of lapw2para_lapw (set
>>>>>>>>> remotemachine =
>>>>>>>>> `head -1 .machine[$p]`) is missing after the compilation.
>>>>>>>>> Adding
>>>>>>>>> this
>>>>>>>>> line by hand gives me a new error message (set: Variable name
>>>>>>>>> must
>>>>>>>>> begin with a letter). This problem is quickly spiraling
>>>>>>>>> beyond my
>>>>>>>>> familiarity with the program. Have others had problems with
>>>>>>>>> the mpi
>>>>>>>>> parallel code? Is this a bug in the code, and if not what
>>>>>>>>> setting do
>>>>>>>>> I need to change? Is the workaround described above correct,
>>>>>>>>> or are
>>>>>>>>> there other files I need to change for proper operation?
>>>>>>>>>
>>>>>>>>> Steven
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Wien mailing list
>>>>>>>>> Wien at zeus.theochem.tuwien.ac.at
>>>>>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Laurence Marks
>>>>>>>> Department of Materials Science and Engineering
>>>>>>>> MSE Rm 2036 Cook Hall
>>>>>>>> 2220 N Campus Drive
>>>>>>>> Northwestern University
>>>>>>>> Evanston, IL 60208, USA
>>>>>>>> Tel: (847) 491-3996 Fax: (847) 491-7820
>>>>>>>> email: L-marks at northwestern dot edu
>>>>>>>> Web: www.numis.northwestern.edu
>>>>>>>> Commission on Electron Diffraction of IUCR
>>>>>>>> www.numis.northwestern.edu/IUCR_CED
>>>>>>>> _______________________________________________
>>>>>>>> Wien mailing list
>>>>>>>> Wien at zeus.theochem.tuwien.ac.at
>>>>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Wien mailing list
>>>>>>> Wien at zeus.theochem.tuwien.ac.at
>>>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Laurence Marks
>>>>>> Department of Materials Science and Engineering
>>>>>> MSE Rm 2036 Cook Hall
>>>>>> 2220 N Campus Drive
>>>>>> Northwestern University
>>>>>> Evanston, IL 60208, USA
>>>>>> Tel: (847) 491-3996 Fax: (847) 491-7820
>>>>>> email: L-marks at northwestern dot edu
>>>>>> Web: www.numis.northwestern.edu
>>>>>> Commission on Electron Diffraction of IUCR
>>>>>> www.numis.northwestern.edu/IUCR_CED
>>>>>> _______________________________________________
>>>>>> Wien mailing list
>>>>>> Wien at zeus.theochem.tuwien.ac.at
>>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>>>
>>>>> _______________________________________________
>>>>> Wien mailing list
>>>>> Wien at zeus.theochem.tuwien.ac.at
>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>>
>>>>
>>>> --
>>>> Laurence Marks
>>>> Department of Materials Science and Engineering
>>>> MSE Rm 2036 Cook Hall
>>>> 2220 N Campus Drive
>>>> Northwestern University
>>>> Evanston, IL 60208, USA
>>>> Tel: (847) 491-3996 Fax: (847) 491-7820
>>>> email: L-marks at northwestern dot edu
>>>> Web: www.numis.northwestern.edu
>>>> Commission on Electron Diffraction of IUCR
>>>> www.numis.northwestern.edu/IUCR_CED
>>>> _______________________________________________
>>>> Wien mailing list
>>>> Wien at zeus.theochem.tuwien.ac.at
>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>
>>
>> --
>> Laurence Marks
>> Department of Materials Science and Engineering
>> MSE Rm 2036 Cook Hall
>> 2220 N Campus Drive
>> Northwestern University
>> Evanston, IL 60208, USA
>> Tel: (847) 491-3996 Fax: (847) 491-7820
>> email: L-marks at northwestern dot edu
>> Web: www.numis.northwestern.edu
>> Commission on Electron Diffraction of IUCR
>> www.numis.northwestern.edu/IUCR_CED
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
More information about the Wien
mailing list