[Wien] Running parallel job with Slurm+intel mpi
Gavin Abo
gsabo at crimson.ua.edu
Thu Jun 23 03:59:19 CEST 2016
SIGSEGV errors can be not easy to solve, because they can have many
possible causes:
http://www.democritos.it/pipermail/pw_forum/2005-March/002272.html
http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors/
However, maybe the cause is the same as a similar case that occurred
before where the error occurred when -lmkl_blacs_lp64 was used when
-lmkl_blacs_openmpi_lp64 should have been used instead for Open MPI [
http://www.mail-archive.com/wien%40zeus.theochem.tuwien.ac.at/msg12746.html
].
What blacs did you use in your parallel compiler options (for RP_LIB)
when you compiled WIEN2k using Open MPI?
On 6/22/2016 7:30 AM, Md. Fhokrul Islam wrote:
>
> Hi Prof Blaha,
>
>
> I have compiled MPI version of Wien2k_14.2 with OpenMPI and have
> got all MPI executables.
>
> But when I run a test calculation using 4 cores, I get the the
> following error message. Could you
>
> please let me know what should I do to fix this problem.
>
>
> case.dayfile
>
>
> start (Tue Jun 21 20:57:01 CEST 2016) with lapw0 (100/99 to go)
>
> cycle 1 (Tue Jun 21 20:57:01 CEST 2016) (100/99 to go)
>
> > lapw0 -p (20:57:01) starting parallel lapw0 at Tue Jun 21
> 20:57:02 CEST 2016
> -------- .machine0 : 4 processors
> 4.684u 0.733s 0:04.50 120.2% 0+0k 220432+4688io 78pf+0w
> > lapw1 -up -p -c (20:57:06) starting parallel lapw1 at
> Tue Jun 21 20:57:06 CEST 2016
> -> starting parallel LAPW1 jobs at Tue Jun 21 20:57:06 CEST 2016
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
> au063 au063 au063 au063(47) Child id 3 SIGSEGV
> Child id 0 SIGSEGV
> Child id 1 SIGSEGV
> Child id 2 SIGSEGV
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 1543169252.
>
>
> Error file:
>
>
> LAPW0 END
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> [au063:29993] 3 more processes have sent help message help-mpi-api.txt
> / mpi-abort
> [au063:29993] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
>
>
> Thanks,
>
> Fhokrul
>
>
>
>
>
> ------------------------------------------------------------------------
> *From:* Wien <wien-bounces at zeus.theochem.tuwien.ac.at> on behalf of
> Peter Blaha <pblaha at theochem.tuwien.ac.at>
> *Sent:* Monday, June 13, 2016 8:54 AM
> *To:* A Mailing list for WIEN2k users
> *Subject:* Re: [Wien] Running parallel job with Slurm+intel mpi
> No, the srun command cannot work with these options, but need other
> switches.
> Most likely in WIEN2k_14.2 you will have to use mpirun, or try:
>
> srun -n_NP_ (i.e. without the blank between n and _NP_)
>
>
> In the next version, we will have support for srun, but at least for
> mixed k-point+mpi-support it requires also a small change in the
> lapw*para scripts.
>
>
>
> On 06/09/2016 04:29 PM, Md. Fhokrul Islam wrote:
> > Dear Wien2k users,
> >
> >
> > I am trying to do some calculation with a large system with mpi
> > version but having problem with
> >
> > running the job. I have compiled Wien2k 14.2 with intel libraries and
> > have generated all mpi executables.
> >
> > But our system requires us to use srun instead of mpirun. So I have
> > changed parallel options to
> >
> >
> > srun -n _NP_ -machinefile _HOSTS_ _EXEC_.
> >
> >
> > I also have tried other options that I saw in the mailing list
> >
> >
> > srun -n _NP_
> >
> >
> > but that didn't work. No new files are created and the dayfile is stuck
> >
> > in lapw0.
> >
> >
> > case.dayfile:
> > ----------------------------------
> > Calculating GaAs in /lunarc/nobackup/users/eishfh/WIEN2k/test/GaAs
> > on au054 with PID 189944
> > using WIEN2k_14.2 (Release 15/10/2014) in
> > /lunarc/nobackup/users/eishfh/SRC/Wien2k14.2-mpi
> >
> >
> > start (Thu Jun 9 13:14:39 CEST 2016) with lapw0 (100/99
> to go)
> >
> > cycle 1 (Thu Jun 9 13:14:39 CEST 2016) (100/99 to go)
> >
> > > lapw0 -p (13:14:39) starting parallel lapw0 at Thu Jun 9
> > 13:14:39 CEST 2016
> > -------- .machine0 : 4 processors
> > -------------------
> >
> >
> > I understood from the userguide that the -p option in runsp_lapw
> picks up
> >
> > mpi version depending on the form of the .machines file. Here is the
> >
> > .machines file that I have used for this test calculation.
> >
> >
> > #
> > lapw0:au165 au165 au165 au165
> > 1:au165 au165 au165 au165
> > granularity:1
> > extrafine:1
> > lapw2_vector_split:2
> >
> > So I am wondering if anyone can tell me how can I fix the problem.
> >
> > Thanks,
> >
> > Fhokrul
> >
> >
> >
> >
> > _______________________________________________
> > Wien mailing list
> > Wien at zeus.theochem.tuwien.ac.at
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> Wien -- A Mailing list for WIEN2k users
> <http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
> zeus.theochem.tuwien.ac.at
> A Mailing list for WIEN2k users. Please post questions, suggestions or
> comments about WIEN2k ONLY in this list. Please follow the following
> "Nettiquette" (depending ...
>
>
>
> > SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> >
>
> --
>
> P.Blaha
> --------------------------------------------------------------------------
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
> Email: blaha at theochem.tuwien.ac.at WIEN2k: http://www.wien2k.at
> WWW: http://www.imc.tuwien.ac.at/staff/tc_group_e.php
> --------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20160622/b7ebc3ec/attachment.html>
More information about the Wien
mailing list