[Wien] "remotemachine: Undefined variable." coming from lapw1para_lapw and lapw2para_lapw

Steven Hahn shahn at iastate.edu
Sun Sep 23 19:48:06 CEST 2007


1)The problem is in siteconfig_lapw. If I manually copy  
lapw1para_lapw and lapw2para_lapw from $WIENROOT/SRC to $WIENROOT, "x  
lapw1 -p -c" and "run_lapw -p -i 1" complete normally. If I run ./ 
siteconfig_lapw again and "Configure parallel execution", the error  
message about remotemachine returns. This ONLY happens with  
USE_REMOTE 1. If I set USE_REMOTE 0 I get the mpi error message  
discussed in my previous message.

I believe the problem is lines 868 and 871 of siteconfig_lapw. If I  
type this line into the commandline and compare files I get the  
following result below. Note that this command is removing  
remotemachine!!!

[shahn at cmp-cluster WIEN2k_073]$     sed -e "s/set remote.*"'$'"/set  
remote = $input/" <lapw1para_lapw >tmp
[shahn at cmp-cluster WIEN2k_073]$ diff lapw1para_lapw tmp
31c31
< set remote = ssh
---
 > set remote =
496c496
<                  set remotemachine = `head -1 .machine[$p]`
---
 >                  set remote =
[shahn at cmp-cluster WIEN2k_073]$


2) Here's my parallel_options file:
[shahn at cmp-cluster WIEN2k_073]$ cat parallel_options
setenv USE_REMOTE 1
setenv WIEN_GRANULARITY 1
setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"


3) Right now I'm running everything as an interactive job to avoid  
the need to create a .machines file on the fly. Once WIEN2k is  
running properly, I'll work on the necessary script.


On Sep 23, 2007, at 11:36 AM, Laurence Marks wrote:

> Three things:
>
> 1) Please diff the version of lapw1para_lapw & lapw2para_lapw that are
> in a freshly loaded $WIENROOT/SRC and the one you have in $WIENROOT to
> ensure that they are the same. The current version has the definition
>
>                  set remotemachine = `head -1 .machine[$p]`
>
> At least on my system even if the file .machine1 (for instance) does
> not exist remotemachine is set to nothing so it is hard to understand
> your finding it as an undefined variable.
>
> 2) Change what is in $WIENROOT/parallel_options to whatever is correct
> for your system -- what is in this file is only supposed to be a
> general framework, not something that works for all systems.
>
> 3) Look at the FAQ, http://www.wien2k.at/reg_user/faq/pbs.html for
> hints to see how to configure for pbs or a similar system.
>
> On 9/23/07, Steven Hahn <shahn at iastate.edu> wrote:
>> Yes, I replaced SRC.tar.gz with the file from the website before
>> running expand_lapw. Reconfiguring and compiling in a new directory
>> (and changing my .bashrc settings) was simply to keep things as clean
>> as possible. For the time being let's assume I have latest files.
>>
>> The machine I am using is a 64 node opteron cluster with 2 dual-core
>> processors per node. It uses openmpi 1.2.3 and openpbs (not sure of
>> the version) for parallel execution and scheduling. I have also
>> verified that passwordless ssh works between nodes. In "configure
>> parallel execution" I said no to "Shared memory architecture, and ssh
>> to "Remote shell". The "mpirun command" is mpirun -np _NP_ -
>> machinefile _HOSTS_ _EXEC_.
>>
>> I have already carefully analyzed the lapw1para file, and in my first
>> message included the line numbers of the problem as well as a
>> potential workaround. The problem is that remotemachine is never
>> defined in lapw1para_lapw. My concern is both the cause of the
>> original error and in the correctness of this "fix."
>>
>> In the interest of testing, I tried reconfiguring WIEN2k for shared
>> memory architecture, and ran on only one node. I also tried
>> configuring and compiling the older 7.2 version with the same
>> parallel setting that I give above for a distributed memory machine.
>> In both cases I received the same error message from openmpi:
>>
>> [shahn at node050 test_case]$ x lapw1 -p -c
>> starting parallel lapw1 at Sun Sep 23 00:50:02 CDT 2007
>> ->  starting parallel LAPW1 jobs at Sun Sep 23 00:50:02 CDT 2007
>> running LAPW1 in parallel mode (using .machines)
>> 1 number_of_parallel_jobs
>> [1] 27569
>> [node050:27571] pls:tm: failed to poll for a spawned proc, return
>> status = 17002
>> [node050:27571] [0,0,0] ORTE_ERROR_LOG: In errno in file rmgr_urm.c
>> at line 462
>> [node050:27571] mpirun: spawn failed with errno=-11
>> [1]  + Done                          ( cd $PWD; $t $ttt; rm -f .lock_
>> $lockfile[$p] ) >> .time1_$loop
>>       node050 node050 node050 node050(2) 0.035u 0.021s 0:00.13
>> 38.4%     0+0k 0+0io 0pf+0w
>> **  LAPW1 crashed!
>> cat: No match.
>> 0.070u 0.252s 0:03.66 8.7%      0+0k 0+0io 0pf+0w
>> error: command   /home/shahn/software/WIEN2k_073/lapw1cpara -c
>> lapw1.def   failed
>>
>> If I bypass the scheduler and ssh directiy into the node this same
>> command completes without errors. Investigating this error message I
>> found that openmpi currently does not support the -machinefile option
>> in our enviroment. While we may call it a bug, openmpi considers it
>> to be a feature. There is a lengthy discussion at the following
>> website(http://www.open-mpi.org/community/lists/users/
>> 2007/05/3184.php) Unfortunately, I don't have access to a machine
>> with either a different flavor of mpi or a different scheduler. The
>> simple workaround to this problem appears to be using version 7.3,
>> which I assume calls mpirun as a remote command. However, that setup
>> gives the error "remotemachine: Undefined variable" that this thread
>> is all about.
>>
>> Steve
>>
>> On Sep 22, 2007, at 2:58 PM, Laurence Marks wrote:
>>
>>> My email got trapped by Wien2k listserver's size limit, so I am
>>> resending.
>>>
>>> This is not a compilation issue, it is whether:
>>> a) You have the correct lapw1para_lapw & lapw2para_lapw
>>> b) You have correctly setup remote execution for your system
>>>
>>> There was a bug with an incorrect version in SRC which has been
>>> corrected; the ones on the web work fine.
>>>
>>> If these do not work for you, please check that the "remote"  
>>> variable
>>> is set correctly, the "configure Parallel execution" part of
>>> siteconfig.
>>>
>>> If you still are not getting anywhere, please add some debug  
>>> lines to
>>> whichever of lapw1para_lapw or lapw2para_lapw is giving problems so
>>> you can trace where the issue is.
>>>
>>> N.B., to be completely clear lapw1para_lapw and lapw2para_lapw are
>>> copied from SRC during the installation, so if you replace SRC you
>>> have of course to do this yourself or use the Wien2k install scripts
>>> to do it for you.
>>>
>>> On 9/22/07, Steven Hahn <shahn at iastate.edu> wrote:
>>>> I tried twice to compile in an empty directory and replacing
>>>> SRC.tar.gz with the file from the web, but each time ran into the
>>>> same problem I describe below. The latest WIEN2k_07.tar.gz and
>>>> SRC.tar.gz from the website are dated August 17, 2007. Is this  
>>>> still
>>>> the erroneous version? Would someone be willing to doublecheck that
>>>> the latest source on the website runs correctly on their system?
>>>>
>>>> Steve
>>>>
>>>> On Sep 21, 2007, at 7:23 PM, Laurence Marks wrote:
>>>>
>>>>> I said SRC, i.e. SRC.tar.gz -- this is where lapw1para is
>>>>> originally.
>>>>>
>>>>> On 9/21/07, Steven Hahn <shahn at iastate.edu> wrote:
>>>>>> Thank you for your prompt reply and suggestion. I just downloaded
>>>>>> SRC_lapw1.tar.gz from the website and diff shows it to be  
>>>>>> identical
>>>>>> to the same file in WIEN2k_07.tar. I tried recompiling anyway,  
>>>>>> but
>>>>>> received the same error message.
>>>>>> On Sep 21, 2007, at 5:02 PM, Laurence Marks wrote:
>>>>>>
>>>>>>> Check lapw1para in SRC of what is currently on the web, i.e.
>>>>>>> download
>>>>>>> just that directory -- there was an erroneous version which I
>>>>>>> believe
>>>>>>> was corrected about a week ago
>>>>>>>
>>>>>>> On 9/21/07, Steven Hahn <shahn at iastate.edu> wrote:
>>>>>>>> Dear all,
>>>>>>>>
>>>>>>>> I am trying to setup the fine-grain parallel (mpi) version of
>>>>>>>> WIEN2k
>>>>>>>> 7.3 on our cluster. I successfully compiled the code, but
>>>>>>>> received an
>>>>>>>> error "remotemachine: Undefined variable." when executing "x
>>>>>>>> lapw1 -p
>>>>>>>> -c" on the test_case benchmark. Investigating this problem I
>>>>>>>> found
>>>>>>>> that adding  "set remotemachine = $machine[$p]" before line
>>>>>>>> 497 of
>>>>>>>> lapw1para_lapw allows the benchmark to complete. Testing the  
>>>>>>>> full
>>>>>>>> iteration with run_lapw on a different case, I had to add the
>>>>>>>> same
>>>>>>>> line before line 315 of lapw2para_lapw for WIEN2k to finish
>>>>>>>> without
>>>>>>>> errors.
>>>>>>>>
>>>>>>>> I've tried recompiling everything a second time from the tar
>>>>>>>> file,
>>>>>>>> but the problem persists. I did notice that line 496 of
>>>>>>>> lapw1para_lapw and line 314 of lapw2para_lapw (set
>>>>>>>> remotemachine =
>>>>>>>> `head -1 .machine[$p]`) is missing after the compilation.  
>>>>>>>> Adding
>>>>>>>> this
>>>>>>>> line by hand gives me a new error message (set: Variable name
>>>>>>>> must
>>>>>>>> begin with a letter). This problem is quickly spiraling  
>>>>>>>> beyond my
>>>>>>>> familiarity with the program. Have others had problems with
>>>>>>>> the mpi
>>>>>>>> parallel code? Is this a bug in the code, and if not what
>>>>>>>> setting do
>>>>>>>> I need to change? Is the workaround described above correct,
>>>>>>>> or are
>>>>>>>> there other files I need to change for proper operation?
>>>>>>>>
>>>>>>>> Steven
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Wien mailing list
>>>>>>>> Wien at zeus.theochem.tuwien.ac.at
>>>>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Laurence Marks
>>>>>>> Department of Materials Science and Engineering
>>>>>>> MSE Rm 2036 Cook Hall
>>>>>>> 2220 N Campus Drive
>>>>>>> Northwestern University
>>>>>>> Evanston, IL 60208, USA
>>>>>>> Tel: (847) 491-3996 Fax: (847) 491-7820
>>>>>>> email: L-marks at northwestern dot edu
>>>>>>> Web: www.numis.northwestern.edu
>>>>>>> Commission on Electron Diffraction of IUCR
>>>>>>> www.numis.northwestern.edu/IUCR_CED
>>>>>>> _______________________________________________
>>>>>>> Wien mailing list
>>>>>>> Wien at zeus.theochem.tuwien.ac.at
>>>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Wien mailing list
>>>>>> Wien at zeus.theochem.tuwien.ac.at
>>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Laurence Marks
>>>>> Department of Materials Science and Engineering
>>>>> MSE Rm 2036 Cook Hall
>>>>> 2220 N Campus Drive
>>>>> Northwestern University
>>>>> Evanston, IL 60208, USA
>>>>> Tel: (847) 491-3996 Fax: (847) 491-7820
>>>>> email: L-marks at northwestern dot edu
>>>>> Web: www.numis.northwestern.edu
>>>>> Commission on Electron Diffraction of IUCR
>>>>> www.numis.northwestern.edu/IUCR_CED
>>>>> _______________________________________________
>>>>> Wien mailing list
>>>>> Wien at zeus.theochem.tuwien.ac.at
>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>>
>>>>
>>>> _______________________________________________
>>>> Wien mailing list
>>>> Wien at zeus.theochem.tuwien.ac.at
>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>
>>>
>>>
>>> --
>>> Laurence Marks
>>> Department of Materials Science and Engineering
>>> MSE Rm 2036 Cook Hall
>>> 2220 N Campus Drive
>>> Northwestern University
>>> Evanston, IL 60208, USA
>>> Tel: (847) 491-3996 Fax: (847) 491-7820
>>> email: L-marks at northwestern dot edu
>>> Web: www.numis.northwestern.edu
>>> Commission on Electron Diffraction of IUCR
>>> www.numis.northwestern.edu/IUCR_CED
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>
>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>
>
> -- 
> Laurence Marks
> Department of Materials Science and Engineering
> MSE Rm 2036 Cook Hall
> 2220 N Campus Drive
> Northwestern University
> Evanston, IL 60208, USA
> Tel: (847) 491-3996 Fax: (847) 491-7820
> email: L-marks at northwestern dot edu
> Web: www.numis.northwestern.edu
> Commission on Electron Diffraction of IUCR
> www.numis.northwestern.edu/IUCR_CED
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>



More information about the Wien mailing list