[Wien] Problems in parallel jobs

bothina hamad both_hamad at yahoo.com
Tue Jul 27 12:41:34 CEST 2010


Dear Laurence,
             Thank you for assisting us in our problems, we are still facing problems, I know it is not related to the code, but we still need the help from experts in parallel compilation.

The code works just fine on a cluster where we have compiled it using the extra '-assu buff' flag to alleviate NFS problems. 
However, on another cluster that has a GPFS parallel file system and does not use NFS the only problem that we encounter is the following compile issue:
 
 
 
 Compile time errors (if any) were:
 SRC_vecpratt/compile.msg:          signal ( SIGBUS, w2ksignal_bus ); /* Bus Error */
 SRC_vecpratt/compile.msg:          signal ( SIGBUS, w2ksignal_bus ); /* Bus Error */
 
 
 Check file   compile.msg   in the corresponding SRC_* directory for the 
 compilation log and more info on any compilation problem.
 
 
 I'm not sure what the vecpratt part of the code actually does.
 
 When I look in the compile.msg file there are just warnings, not errors and looking in the SRC_vecpratt directory I see that the executables are actually built. 
 
where is vecpratt used by the program?


Thanks in advance
Bothina 


--- On Wed, 7/21/10, Laurence Marks <L-marks at northwestern.edu> wrote:

> From: Laurence Marks <L-marks at northwestern.edu>
> Subject: Re: [Wien] Problems in parallel jobs
> To: "A Mailing list for WIEN2k users" <wien at zeus.theochem.tuwien.ac.at>
> Date: Wednesday, July 21, 2010, 3:03 PM
> Also:
> 
> 5) Use ldd of $WIENROOT/lapw1_mpi on different nodes to
> check that
> these are correct, and also "which mpirun".
> 
> On Wed, Jul 21, 2010 at 6:47 AM, Laurence Marks
> <L-marks at northwestern.edu>
> wrote:
> > Hard to know for certain, but this looks like a OS
> problem rather than
> > a WIen2k issue. Things to check:
> >
> > 1) Use ompi_info and check, carefully, that the
> compilation
> > options/libraries used for openmpi are the same as
> what you are using
> > to compile Wien2k.
> >
> > 2) For 10.1, ulimit -s is not needed for mpi (and in
> any case does
> > nothing with openmpi) as this is done in software in
> Wien2kutils.c.
> > Make sure that you are exporting environmental
> parameters in your
> > mpirun call, for instance use in parallel_options
> > setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH
> -np _NP_
> > -machinefile _HOSTS_ _EXEC_"
> >
> > 3) Check the size of the job you are running, e.g. via
> top, by looking
> > in case.output1_X. by using "lapw1 -p -nmat_only",
> using ganglia or
> > nmon, cat /proc/meminfo (or anything else you have
> available).
> > Particularly with openmpi but with some other flavors
> as well, if you
> > are asking for too much memory and/or have too many
> processes running,
> > problems occur. A race condition can also occur in
> openmpi which makes
> > this problem worse (maybe patched in latest version, I
> am not sure).
> >
> > 4) Check, carefully (twice) for format errors in the
> input files. It
> > turns out that ifort has it's own signal traps so a
> child can exit
> > without correctly calling mpi_abort. A race condition
> can occur with
> > openmpi when the parent is trying to find a child, the
> child does not
> > exist, the parent waits then keeps going....
> >
> > 5) Check the OS logs in /var/log (beyond my
> competence). You may have
> > too high a nfs load, bad infiniband/myrinet (recent
> OFED?) etc. Use
> > -assu buff in compilation options to reduce nfs load.
> >
> > On Wed, Jul 21, 2010 at 3:53 AM, bothina hamad <both_hamad at yahoo.com>
> wrote:
> >> Dear Wien users,
> >>
> >> When running optimisation jobs under torque
> queuing system for anything but
> >> very small systems:
> >>
> >> Job runs for many cycles using lapw0, lapw1, lapw2
> (parallel) successfully but eventually the 'mom-superior'
> node (that launches ) mpirun becomes non-communicating with
> the other nodes involved with the job.
> >>
> >> At the console of this node there is correct load
> (4 for quad processor) and memory free... but can no longer
> access any nfs mounts, can no longer ping other nodes in
> cluster... am eventually forced to reboot node and kill job
> from cluster queuing system (job enters 'E' state and stays
> there... need to stop pbs_server and manually remove
> jobfiles from /var/spool/torque/server_priv/jobs... then
> restart pbs_server)
> >>
> >> A similar problem is encountered on larger cluster
> (same install procedure) but with added problem that the
> .dayfile reports that for lapw2 only the 'mom-superior' node
> is reporting doing work (even though logging into other job
> nodes top reports correct load and 100%cpu use).
> >>
> >> DOS calculation seems to work properly on both
> clusters...
> >>
> >> We have used a modified x_lapw that you provided
> earlier.
> >> We have been inserting 'ulimit -s unlimited' into
> job-scripts
> >>
> >> We are using...
> >> Centos5.3 x86_64
> >> Intel compiler suite with mkl v11.1/072
> >> openmpi-1.4.2, compiled with intel compilers
> >> fftw-2.1.5, compiled with intel compilers and
> openmpi above
> >> Wien2k v10.1
> >>
> >> Optimisation jobs for small systems complete OK on
> both clusters.
> >>
> >> The working directories for this job are large
> (>2GB).
> >>
> >>  Please let us know what
> >> files we could send you from these that may be
> helpful for diagnosis...
> >>
> >> Best regards
> >> Bothina
> >>
> >>
> >>
> >> _______________________________________________
> >> Wien mailing list
> >> Wien at zeus.theochem.tuwien.ac.at
> >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >>
> >
> >
> >
> > --
> > Laurence Marks
> > Department of Materials Science and Engineering
> > MSE Rm 2036 Cook Hall
> > 2220 N Campus Drive
> > Northwestern University
> > Evanston, IL 60208, USA
> > Tel: (847) 491-3996 Fax: (847) 491-7820
> > email: L-marks at northwestern dot edu
> > Web: www.numis.northwestern.edu
> > Chair, Commission on Electron Crystallography of IUCR
> > www.numis.northwestern.edu/
> > Electron crystallography is the branch of science that
> uses electron
> > scattering and imaging to study the structure of
> matter.
> >
> 
> 
> 
> -- 
> Laurence Marks
> Department of Materials Science and Engineering
> MSE Rm 2036 Cook Hall
> 2220 N Campus Drive
> Northwestern University
> Evanston, IL 60208, USA
> Tel: (847) 491-3996 Fax: (847) 491-7820
> email: L-marks at northwestern dot edu
> Web: www.numis.northwestern.edu
> Chair, Commission on Electron Crystallography of IUCR
> www.numis.northwestern.edu/
> Electron crystallography is the branch of science that uses
> electron
> scattering and imaging to study the structure of matter.
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> 


      


More information about the Wien mailing list