[Wien] "remotemachine: Undefined variable." coming from lapw1para_lapw and lapw2para_lapw
Steven Hahn
shahn at iastate.edu
Sun Sep 23 16:35:30 CEST 2007
Yes, I replaced SRC.tar.gz with the file from the website before
running expand_lapw. Reconfiguring and compiling in a new directory
(and changing my .bashrc settings) was simply to keep things as clean
as possible. For the time being let's assume I have latest files.
The machine I am using is a 64 node opteron cluster with 2 dual-core
processors per node. It uses openmpi 1.2.3 and openpbs (not sure of
the version) for parallel execution and scheduling. I have also
verified that passwordless ssh works between nodes. In "configure
parallel execution" I said no to "Shared memory architecture, and ssh
to "Remote shell". The "mpirun command" is mpirun -np _NP_ -
machinefile _HOSTS_ _EXEC_.
I have already carefully analyzed the lapw1para file, and in my first
message included the line numbers of the problem as well as a
potential workaround. The problem is that remotemachine is never
defined in lapw1para_lapw. My concern is both the cause of the
original error and in the correctness of this "fix."
In the interest of testing, I tried reconfiguring WIEN2k for shared
memory architecture, and ran on only one node. I also tried
configuring and compiling the older 7.2 version with the same
parallel setting that I give above for a distributed memory machine.
In both cases I received the same error message from openmpi:
[shahn at node050 test_case]$ x lapw1 -p -c
starting parallel lapw1 at Sun Sep 23 00:50:02 CDT 2007
-> starting parallel LAPW1 jobs at Sun Sep 23 00:50:02 CDT 2007
running LAPW1 in parallel mode (using .machines)
1 number_of_parallel_jobs
[1] 27569
[node050:27571] pls:tm: failed to poll for a spawned proc, return
status = 17002
[node050:27571] [0,0,0] ORTE_ERROR_LOG: In errno in file rmgr_urm.c
at line 462
[node050:27571] mpirun: spawn failed with errno=-11
[1] + Done ( cd $PWD; $t $ttt; rm -f .lock_
$lockfile[$p] ) >> .time1_$loop
node050 node050 node050 node050(2) 0.035u 0.021s 0:00.13
38.4% 0+0k 0+0io 0pf+0w
** LAPW1 crashed!
cat: No match.
0.070u 0.252s 0:03.66 8.7% 0+0k 0+0io 0pf+0w
error: command /home/shahn/software/WIEN2k_073/lapw1cpara -c
lapw1.def failed
If I bypass the scheduler and ssh directiy into the node this same
command completes without errors. Investigating this error message I
found that openmpi currently does not support the -machinefile option
in our enviroment. While we may call it a bug, openmpi considers it
to be a feature. There is a lengthy discussion at the following
website(http://www.open-mpi.org/community/lists/users/
2007/05/3184.php) Unfortunately, I don't have access to a machine
with either a different flavor of mpi or a different scheduler. The
simple workaround to this problem appears to be using version 7.3,
which I assume calls mpirun as a remote command. However, that setup
gives the error "remotemachine: Undefined variable" that this thread
is all about.
Steve
On Sep 22, 2007, at 2:58 PM, Laurence Marks wrote:
> My email got trapped by Wien2k listserver's size limit, so I am
> resending.
>
> This is not a compilation issue, it is whether:
> a) You have the correct lapw1para_lapw & lapw2para_lapw
> b) You have correctly setup remote execution for your system
>
> There was a bug with an incorrect version in SRC which has been
> corrected; the ones on the web work fine.
>
> If these do not work for you, please check that the "remote" variable
> is set correctly, the "configure Parallel execution" part of
> siteconfig.
>
> If you still are not getting anywhere, please add some debug lines to
> whichever of lapw1para_lapw or lapw2para_lapw is giving problems so
> you can trace where the issue is.
>
> N.B., to be completely clear lapw1para_lapw and lapw2para_lapw are
> copied from SRC during the installation, so if you replace SRC you
> have of course to do this yourself or use the Wien2k install scripts
> to do it for you.
>
> On 9/22/07, Steven Hahn <shahn at iastate.edu> wrote:
>> I tried twice to compile in an empty directory and replacing
>> SRC.tar.gz with the file from the web, but each time ran into the
>> same problem I describe below. The latest WIEN2k_07.tar.gz and
>> SRC.tar.gz from the website are dated August 17, 2007. Is this still
>> the erroneous version? Would someone be willing to doublecheck that
>> the latest source on the website runs correctly on their system?
>>
>> Steve
>>
>> On Sep 21, 2007, at 7:23 PM, Laurence Marks wrote:
>>
>>> I said SRC, i.e. SRC.tar.gz -- this is where lapw1para is
>>> originally.
>>>
>>> On 9/21/07, Steven Hahn <shahn at iastate.edu> wrote:
>>>> Thank you for your prompt reply and suggestion. I just downloaded
>>>> SRC_lapw1.tar.gz from the website and diff shows it to be identical
>>>> to the same file in WIEN2k_07.tar. I tried recompiling anyway, but
>>>> received the same error message.
>>>> On Sep 21, 2007, at 5:02 PM, Laurence Marks wrote:
>>>>
>>>>> Check lapw1para in SRC of what is currently on the web, i.e.
>>>>> download
>>>>> just that directory -- there was an erroneous version which I
>>>>> believe
>>>>> was corrected about a week ago
>>>>>
>>>>> On 9/21/07, Steven Hahn <shahn at iastate.edu> wrote:
>>>>>> Dear all,
>>>>>>
>>>>>> I am trying to setup the fine-grain parallel (mpi) version of
>>>>>> WIEN2k
>>>>>> 7.3 on our cluster. I successfully compiled the code, but
>>>>>> received an
>>>>>> error "remotemachine: Undefined variable." when executing "x
>>>>>> lapw1 -p
>>>>>> -c" on the test_case benchmark. Investigating this problem I
>>>>>> found
>>>>>> that adding "set remotemachine = $machine[$p]" before line
>>>>>> 497 of
>>>>>> lapw1para_lapw allows the benchmark to complete. Testing the full
>>>>>> iteration with run_lapw on a different case, I had to add the
>>>>>> same
>>>>>> line before line 315 of lapw2para_lapw for WIEN2k to finish
>>>>>> without
>>>>>> errors.
>>>>>>
>>>>>> I've tried recompiling everything a second time from the tar
>>>>>> file,
>>>>>> but the problem persists. I did notice that line 496 of
>>>>>> lapw1para_lapw and line 314 of lapw2para_lapw (set
>>>>>> remotemachine =
>>>>>> `head -1 .machine[$p]`) is missing after the compilation. Adding
>>>>>> this
>>>>>> line by hand gives me a new error message (set: Variable name
>>>>>> must
>>>>>> begin with a letter). This problem is quickly spiraling beyond my
>>>>>> familiarity with the program. Have others had problems with
>>>>>> the mpi
>>>>>> parallel code? Is this a bug in the code, and if not what
>>>>>> setting do
>>>>>> I need to change? Is the workaround described above correct,
>>>>>> or are
>>>>>> there other files I need to change for proper operation?
>>>>>>
>>>>>> Steven
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Wien mailing list
>>>>>> Wien at zeus.theochem.tuwien.ac.at
>>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Laurence Marks
>>>>> Department of Materials Science and Engineering
>>>>> MSE Rm 2036 Cook Hall
>>>>> 2220 N Campus Drive
>>>>> Northwestern University
>>>>> Evanston, IL 60208, USA
>>>>> Tel: (847) 491-3996 Fax: (847) 491-7820
>>>>> email: L-marks at northwestern dot edu
>>>>> Web: www.numis.northwestern.edu
>>>>> Commission on Electron Diffraction of IUCR
>>>>> www.numis.northwestern.edu/IUCR_CED
>>>>> _______________________________________________
>>>>> Wien mailing list
>>>>> Wien at zeus.theochem.tuwien.ac.at
>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>>
>>>>
>>>> _______________________________________________
>>>> Wien mailing list
>>>> Wien at zeus.theochem.tuwien.ac.at
>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>
>>>
>>>
>>> --
>>> Laurence Marks
>>> Department of Materials Science and Engineering
>>> MSE Rm 2036 Cook Hall
>>> 2220 N Campus Drive
>>> Northwestern University
>>> Evanston, IL 60208, USA
>>> Tel: (847) 491-3996 Fax: (847) 491-7820
>>> email: L-marks at northwestern dot edu
>>> Web: www.numis.northwestern.edu
>>> Commission on Electron Diffraction of IUCR
>>> www.numis.northwestern.edu/IUCR_CED
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>
>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>
>
> --
> Laurence Marks
> Department of Materials Science and Engineering
> MSE Rm 2036 Cook Hall
> 2220 N Campus Drive
> Northwestern University
> Evanston, IL 60208, USA
> Tel: (847) 491-3996 Fax: (847) 491-7820
> email: L-marks at northwestern dot edu
> Web: www.numis.northwestern.edu
> Commission on Electron Diffraction of IUCR
> www.numis.northwestern.edu/IUCR_CED
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
More information about the Wien
mailing list