[Wien] MIXER runtime error + solution on Mac OS X

Kevin Jorissen kevinjorissenpdx at gmail.com
Wed Sep 3 02:04:09 CEST 2014


Another update to this thread.

First, Jianxin Zhu did some more tests and here are the findings:


###############################################################
I have a few items to share ---

A) ---
I compared the total energy from running with version 13.1 and version 12.1
on my two mac osx boxes (with the same condition). The values of the total
energy are respectively,

:ENE  : ********** TOTAL ENERGY IN Ry =      -244362.69826119
:ENE  : ********** TOTAL ENERGY IN Ry =      -244362.69826094
:ENE  : ********** TOTAL ENERGY IN Ry =      -244362.69826095
:ENE  : ********** TOTAL ENERGY IN Ry =      -244362.69826079
with version 13.1,
and
:ENE  : ********** TOTAL ENERGY IN Ry =      -244362.69576023
:ENE  : ********** TOTAL ENERGY IN Ry =      -244362.69576000
:ENE  : ********** TOTAL ENERGY IN Ry =      -244362.69576000
:ENE  : ********** TOTAL ENERGY IN Ry =      -244362.69576001
with version 12.1

I have also compared the values of total energy from running with the same
version 13.1 on my linux cluster (there without the option -heap–arrays in
the compilation).
:ENE  : ********** TOTAL ENERGY IN Ry =      -244362.69827763
:ENE  : ********** TOTAL ENERGY IN Ry =      -244362.69827704
:ENE  : ********** TOTAL ENERGY IN Ry =      -244362.69827626
:ENE  : ********** TOTAL ENERGY IN Ry =      -244362.69827588

[I am using the following convergence criterion for benchmark ---
 run_lapw -ec 0.000001 -i 40 -NI –p]

B) ---
Without the use of the option, -heap-arrays, if you can print out those
matrices before the subroutine call NormS in qmix8.F, you should be able to
see some part of the matrices (I have no time to identify which of them)
has NAN.

C) ---
I also compared the W2kutils.c in version 12.1 and that in version 13.1.

The following lines

#ifdef __APPLE__
    limit.rlim_cur = limit.rlim_max ; /* RLIM_INFINITY */
#else
    limit.rlim_cur = RLIM_INFINITY ;
#endif

are in version 13.1; while

    /* Set to the maximum we can */
    limit.rlim_cur = limit.rlim_max;    /* limit.rlim_max; RLIM_INFINITY;*/

Are in version 12.1, which I put in a few years ago.
In the assumption that if #ifdef __APPLE__ is automatically active, there
is really no difference to cause the problem we currently have.


###############################################################
Second, I add the following comments:
* seeing NaN sounds pretty serious

* I'm a bit mystified by this W2kutils stuff.  I don't think that "APPLE"
is activated by siteconfig in any way??  Should I add something like
"-DAPPLE" to the FOPT in my Makefiles?




Cheers,


Kevin



On Sun, Aug 31, 2014 at 11:23 PM, Laurence Marks <L-marks at northwestern.edu>
wrote:

> Dear Kevin,
>
> No problem with your email. All large codes have bugs, and sometimes I
> write sloppy code. I do try and keep mixer as free of bugs as I can since I
> wrote the multisecant algorithms.
>
> Listing the W2kutils issue - good idea, hint to Peter.
>
> ___________________________
>
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> www.numis.northwestern.edu1-847-491-3996
>
> Co-Editor, Acta Cryst A
> "Research is to see what everybody else has seen, and to think what nobody
> else has thought"
> Albert Szent-Gyorgi
>
> On Sep 1, 2014 1:13 AM, "Kevin Jorissen" <kevinjorissenpdx at gmail.com>
> wrote:
>
>>  Hi Laurence,
>>
>>  thanks for your comments.
>>
>>  I hope I didn't call the issue we observed a code bug -- I meant to use
>> unsensational language and avoid assumptions.  For sure this could be a
>> problem on the Mac side or in ifort (we all know these exist).  I haven't
>> edited the W2kutils.  But didn't we fix the Mac problems with that file a
>> few years ago?  In any case, I'm not using MPI and stacksize is set to
>> unlimited in my shell startup file, so I doubt this is the culprit.  Or
>> could the W2kutils somehow override my shell startup configuration?
>>
>>  It's probably not urgent since we have a remedy that will do for now.
>>  If you can think of any tests you'd like to see done on Mac, let us know.
>>
>>  By the way, this W2kutils thing is ***NOT*** on the list of known
>> issues and bugs on the WIEN2k website.  It would be very, very valuable and
>> time-saving if that list could be updated to reflect the knowledge inside
>> the experts' heads.
>>
>>  Cheers,
>>
>>  Kevin
>>
>>
>>
>> On Sun, Aug 31, 2014 at 10:47 PM, Laurence Marks <
>> L-marks at northwestern.edu> wrote:
>>
>>> I am currently at a conference in Montenegro, so don't have enough time
>>> to check properly. While this could be a code bug, I suspect an OS bug
>>> connected to the known problem in W2kutils for Mac of setting the stack
>>> size. Do you have this commented out?
>>>
>>>  To expand, the reason W2kutils sets the stack size is because this was
>>> a very common problem (look at the mail list some years ago for ulimit),
>>> some sys_admins were setting it too low and openmpi was not by default
>>> passing ulimit values. If it is not large enough problems occur. The
>>> argument you are using -heap-arrays puts arrays onto disc (it is
>>> similar to the Fortran save command). This is slower, although this does
>>> not matter much in mixer.
>>>
>>>  Unless you can identify something specific, I am not sure what I can
>>> do as I have no access to Mac. Maybe run mixer using ddd (or gdb) ? As one
>>> caveat, with this type of issue sometimes it does not show up at the source.
>>>
>>>  N.B. mixer is a bit of a memory hog, and sometime I should try and
>>> clean up some of the arrays. Unfortunately this is hard with code that is
>>> changing.
>>>
>>>
>>>  On Sun, Aug 31, 2014 at 6:30 PM, Kevin Jorissen <
>>> kevinjorissenpdx at gmail.com> wrote:
>>>
>>>>  Thanks, Martin, for sharing some advanced ideas.
>>>>
>>>>  I spent a few minutes trying to find out more, throwing a diagnostic
>>>> compile line at the problem :
>>>>
>>>> -gen-interfaces -warn interfaces -fp-stack-check -g -traceback -check
>>>> arg_temp_created -check bounds
>>>>  trying to catch anything potentially suspicious.  The problem with
>>>> most codes I've worked on is that you typically catch a bunch of unrelated
>>>> things that obscure the analysis :).  In this case, e.g., the argument F to
>>>> TrustStep (called before the NormS mentioned earlier) is an allocated array
>>>> on one side and implicit on the other, and that offends the compile options
>>>> above.  I don't have much time for analysis right now - maybe the mixer
>>>> developers will immediately spot what's going on in my earlier e-mail.
>>>>  "check bounds" or "check all" by themselves don't give any runtime
>>>> diagnostics, so I'm guessing we're not overstepping array bounds explicitly.
>>>>
>>>>  If you have a more specific idea for a test, I or maybe Jianxin can
>>>> try to run it for you.  I guess a basic one would be to just do the
>>>> run_lapw calculation on Linux vs. Mac (with -heap-arrays) and see if the
>>>> results are identical.
>>>>
>>>>  Cheers,
>>>>
>>>>  Kevin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>  On Sun, Aug 31, 2014 at 4:29 PM, Martin Kroeker <
>>>> martin at ruby.chemie.uni-freiburg.de> wrote:
>>>>
>>>>> This might warrant closer scrutiny - was it reproducible with any odd
>>>>> tutorial problem, or does it require a particular case or type of
>>>>> calculation ?
>>>>> The "illegal instruction" abort signals that data was somehow spilling
>>>>> over into the memory ranges holding the executable code. Now I would
>>>>> not
>>>>> expect a "simple" heap-stack-collision (from an array that is simply
>>>>> too
>>>>> big to put on the stack with impunity) to occur on any modern system
>>>>> except perhaps severely constrained embedded ones. At worst, the abort
>>>>> should have been accompanied by a "segmentation fault" message as the
>>>>> attempt to overwrite the running program got caught. So other possible
>>>>> explanations could be that the code tries to store more array elements
>>>>> than the array was designed to hold, or that the indexes into the array
>>>>> are miscalculated (overflowing or not clamped to positive values).
>>>>> Moving data to the heap may have just changed the location of the
>>>>> inadvertently overwritten memory to ranges where the effects are more
>>>>> subtle (unrelated data) or not noticable (lucky hit on unused memory).
>>>>> --
>>>>> Dr. Martin Kroeker            martin at ruby.chemie.uni-freiburg.de
>>>>> c/o Prof.Dr. Caroline Roehr
>>>>> Institut fuer Anorganische und Analytische Chemie der Universitaet
>>>>> Freiburg
>>>>>
>>>>> _______________________________________________
>>>>> Wien mailing list
>>>>> Wien at zeus.theochem.tuwien.ac.at
>>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>>> SEARCH the MAILING-LIST at:
>>>>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>>>>>
>>>>
>>>>
>>>
>>>
>>>  --
>>> Professor Laurence Marks
>>> Department of Materials Science and Engineering
>>> Northwestern University
>>> www.numis.northwestern.edu
>>> Corrosion in 4D: MURI4D.numis.northwestern.edu
>>> Co-Editor, Acta Cryst A
>>> "Research is to see what everybody else has seen, and to think what
>>> nobody else has thought"
>>> Albert Szent-Gyorgi
>>>
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>> SEARCH the MAILING-LIST at:
>>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>>>
>>>
>>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20140902/2708097a/attachment.htm>


More information about the Wien mailing list