[Wien] Problems in parallel jobs

Laurence Marks L-marks at northwestern.edu
Tue Jul 27 13:59:42 CEST 2010


This is not an error, just something untidy in the code (which should
be cleaned up), not an error.

The compilation script does a "grep -e Error" on a file compile.msg
which sometimes (depends upon what c flags are used) gives a warning
for the two lines you mentioned. The grep picks up the material within
the "/*" and "*/" which is a c inline comment and prints it out in a
fashion which is misleading.

N.B., if you need to fix this just remove "Error" from
SRC_vecpratt/W2kutils.c so the relevant line just reads
signal ( SIGBUS, w2ksignal_bus ); /* Bus  */


On Tue, Jul 27, 2010 at 5:41 AM, bothina hamad <both_hamad at yahoo.com> wrote:
> Dear Laurence,
>             Thank you for assisting us in our problems, we are still facing problems, I know it is not related to the code, but we still need the help from experts in parallel compilation.
>
> The code works just fine on a cluster where we have compiled it using the extra '-assu buff' flag to alleviate NFS problems.
> However, on another cluster that has a GPFS parallel file system and does not use NFS the only problem that we encounter is the following compile issue:
>
>
>
>  Compile time errors (if any) were:
>  SRC_vecpratt/compile.msg:          signal ( SIGBUS, w2ksignal_bus ); /* Bus Error */
>  SRC_vecpratt/compile.msg:          signal ( SIGBUS, w2ksignal_bus ); /* Bus Error */
>
>
>  Check file   compile.msg   in the corresponding SRC_* directory for the
>  compilation log and more info on any compilation problem.
>
>
>  I'm not sure what the vecpratt part of the code actually does.
>
>  When I look in the compile.msg file there are just warnings, not errors and looking in the SRC_vecpratt directory I see that the executables are actually built.
>
> where is vecpratt used by the program?
>
>
> Thanks in advance
> Bothina
>
>
> --- On Wed, 7/21/10, Laurence Marks <L-marks at northwestern.edu> wrote:
>
>> From: Laurence Marks <L-marks at northwestern.edu>
>> Subject: Re: [Wien] Problems in parallel jobs
>> To: "A Mailing list for WIEN2k users" <wien at zeus.theochem.tuwien.ac.at>
>> Date: Wednesday, July 21, 2010, 3:03 PM
>> Also:
>>
>> 5) Use ldd of $WIENROOT/lapw1_mpi on different nodes to
>> check that
>> these are correct, and also "which mpirun".
>>
>> On Wed, Jul 21, 2010 at 6:47 AM, Laurence Marks
>> <L-marks at northwestern.edu>
>> wrote:
>> > Hard to know for certain, but this looks like a OS
>> problem rather than
>> > a WIen2k issue. Things to check:
>> >
>> > 1) Use ompi_info and check, carefully, that the
>> compilation
>> > options/libraries used for openmpi are the same as
>> what you are using
>> > to compile Wien2k.
>> >
>> > 2) For 10.1, ulimit -s is not needed for mpi (and in
>> any case does
>> > nothing with openmpi) as this is done in software in
>> Wien2kutils.c.
>> > Make sure that you are exporting environmental
>> parameters in your
>> > mpirun call, for instance use in parallel_options
>> > setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH
>> -np _NP_
>> > -machinefile _HOSTS_ _EXEC_"
>> >
>> > 3) Check the size of the job you are running, e.g. via
>> top, by looking
>> > in case.output1_X. by using "lapw1 -p -nmat_only",
>> using ganglia or
>> > nmon, cat /proc/meminfo (or anything else you have
>> available).
>> > Particularly with openmpi but with some other flavors
>> as well, if you
>> > are asking for too much memory and/or have too many
>> processes running,
>> > problems occur. A race condition can also occur in
>> openmpi which makes
>> > this problem worse (maybe patched in latest version, I
>> am not sure).
>> >
>> > 4) Check, carefully (twice) for format errors in the
>> input files. It
>> > turns out that ifort has it's own signal traps so a
>> child can exit
>> > without correctly calling mpi_abort. A race condition
>> can occur with
>> > openmpi when the parent is trying to find a child, the
>> child does not
>> > exist, the parent waits then keeps going....
>> >
>> > 5) Check the OS logs in /var/log (beyond my
>> competence). You may have
>> > too high a nfs load, bad infiniband/myrinet (recent
>> OFED?) etc. Use
>> > -assu buff in compilation options to reduce nfs load.
>> >
>> > On Wed, Jul 21, 2010 at 3:53 AM, bothina hamad <both_hamad at yahoo.com>
>> wrote:
>> >> Dear Wien users,
>> >>
>> >> When running optimisation jobs under torque
>> queuing system for anything but
>> >> very small systems:
>> >>
>> >> Job runs for many cycles using lapw0, lapw1, lapw2
>> (parallel) successfully but eventually the 'mom-superior'
>> node (that launches ) mpirun becomes non-communicating with
>> the other nodes involved with the job.
>> >>
>> >> At the console of this node there is correct load
>> (4 for quad processor) and memory free... but can no longer
>> access any nfs mounts, can no longer ping other nodes in
>> cluster... am eventually forced to reboot node and kill job
>> from cluster queuing system (job enters 'E' state and stays
>> there... need to stop pbs_server and manually remove
>> jobfiles from /var/spool/torque/server_priv/jobs... then
>> restart pbs_server)
>> >>
>> >> A similar problem is encountered on larger cluster
>> (same install procedure) but with added problem that the
>> .dayfile reports that for lapw2 only the 'mom-superior' node
>> is reporting doing work (even though logging into other job
>> nodes top reports correct load and 100%cpu use).
>> >>
>> >> DOS calculation seems to work properly on both
>> clusters...
>> >>
>> >> We have used a modified x_lapw that you provided
>> earlier.
>> >> We have been inserting 'ulimit -s unlimited' into
>> job-scripts
>> >>
>> >> We are using...
>> >> Centos5.3 x86_64
>> >> Intel compiler suite with mkl v11.1/072
>> >> openmpi-1.4.2, compiled with intel compilers
>> >> fftw-2.1.5, compiled with intel compilers and
>> openmpi above
>> >> Wien2k v10.1
>> >>
>> >> Optimisation jobs for small systems complete OK on
>> both clusters.
>> >>
>> >> The working directories for this job are large
>> (>2GB).
>> >>
>> >>  Please let us know what
>> >> files we could send you from these that may be
>> helpful for diagnosis...
>> >>
>> >> Best regards
>> >> Bothina
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> Wien mailing list
>> >> Wien at zeus.theochem.tuwien.ac.at
>> >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> >>
>> >
>> >
>> >
>> > --
>> > Laurence Marks
>> > Department of Materials Science and Engineering
>> > MSE Rm 2036 Cook Hall
>> > 2220 N Campus Drive
>> > Northwestern University
>> > Evanston, IL 60208, USA
>> > Tel: (847) 491-3996 Fax: (847) 491-7820
>> > email: L-marks at northwestern dot edu
>> > Web: www.numis.northwestern.edu
>> > Chair, Commission on Electron Crystallography of IUCR
>> > www.numis.northwestern.edu/
>> > Electron crystallography is the branch of science that
>> uses electron
>> > scattering and imaging to study the structure of
>> matter.
>> >
>>
>>
>>
>> --
>> Laurence Marks
>> Department of Materials Science and Engineering
>> MSE Rm 2036 Cook Hall
>> 2220 N Campus Drive
>> Northwestern University
>> Evanston, IL 60208, USA
>> Tel: (847) 491-3996 Fax: (847) 491-7820
>> email: L-marks at northwestern dot edu
>> Web: www.numis.northwestern.edu
>> Chair, Commission on Electron Crystallography of IUCR
>> www.numis.northwestern.edu/
>> Electron crystallography is the branch of science that uses
>> electron
>> scattering and imaging to study the structure of matter.
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>



-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Electron crystallography is the branch of science that uses electron
scattering and imaging to study the structure of matter.


More information about the Wien mailing list