[Wien] Running parallel job with Slurm+intel mpi

Md. Fhokrul Islam fislam at hotmail.com
Thu Jun 23 11:34:54 CEST 2016


Hi Gavin,


    Thank you very much. You are right, I was using -lmkl_blacs_intelmpi_lp64 instead of -lmkl_blacs_openmpi_lp64.

I have recompiled and it is working now.


Thanks again,

Fhokrul


________________________________
From: Wien <wien-bounces at zeus.theochem.tuwien.ac.at> on behalf of Gavin Abo <gsabo at crimson.ua.edu>
Sent: Thursday, June 23, 2016 1:59 AM
To: A Mailing list for WIEN2k users
Subject: Re: [Wien] Running parallel job with Slurm+intel mpi

SIGSEGV errors can be not easy to solve, because they can have many possible causes:

http://www.democritos.it/pipermail/pw_forum/2005-March/002272.html
http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors/

However, maybe the cause is the same as a similar case that occurred before where the error occurred when -lmkl_blacs_lp64 was used when -lmkl_blacs_openmpi_lp64 should have been used instead for Open MPI [ http://www.mail-archive.com/wien%40zeus.theochem.tuwien.ac.at/msg12746.html ].

What blacs did you use in your parallel compiler options (for RP_LIB) when you compiled WIEN2k using Open MPI?

On 6/22/2016 7:30 AM, Md. Fhokrul Islam wrote:

Hi Prof Blaha,


    I have compiled MPI version of Wien2k_14.2 with OpenMPI and have got all MPI executables.

But when I run a test calculation using 4 cores, I get the the following error message. Could you

please let me know what should I do to fix this problem.


case.dayfile


   start       (Tue Jun 21 20:57:01 CEST 2016) with lapw0 (100/99 to go)

    cycle 1     (Tue Jun 21 20:57:01 CEST 2016)         (100/99 to go)

>   lapw0 -p    (20:57:01) starting parallel lapw0 at Tue Jun 21 20:57:02 CEST 2016
-------- .machine0 : 4 processors
4.684u 0.733s 0:04.50 120.2%    0+0k 220432+4688io 78pf+0w
>   lapw1  -up -p    -c         (20:57:06) starting parallel lapw1 at Tue Jun 21 20:57:06 CEST 2016
->  starting parallel LAPW1 jobs at Tue Jun 21 20:57:06 CEST 2016
running LAPW1 in parallel mode (using .machines)
1 number_of_parallel_jobs
     au063 au063 au063 au063(47)  Child id           3 SIGSEGV
 Child id           0 SIGSEGV
 Child id           1 SIGSEGV
 Child id           2 SIGSEGV
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1543169252.



Error file:


LAPW0 END
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
[au063:29993] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
[au063:29993] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages



Thanks,

Fhokrul




________________________________
From: Wien <wien-bounces at zeus.theochem.tuwien.ac.at><mailto:wien-bounces at zeus.theochem.tuwien.ac.at> on behalf of Peter Blaha <pblaha at theochem.tuwien.ac.at><mailto:pblaha at theochem.tuwien.ac.at>
Sent: Monday, June 13, 2016 8:54 AM
To: A Mailing list for WIEN2k users
Subject: Re: [Wien] Running parallel job with Slurm+intel mpi

No, the srun command cannot work with these options, but need other
switches.
Most likely in WIEN2k_14.2 you will have to use   mpirun, or try:

srun -n_NP_   (i.e. without the blank between n and _NP_)


In the next version, we will have support for srun, but at least for
mixed k-point+mpi-support it requires also a small change in the
lapw*para scripts.



On 06/09/2016 04:29 PM, Md. Fhokrul Islam wrote:
> Dear Wien2k users,
>
>
>       I am trying to do some calculation with a large system with mpi
> version but having problem with
>
> running the job. I have compiled Wien2k 14.2 with intel libraries and
> have generated all mpi executables.
>
> But our system requires us to use srun instead of mpirun. So I have
> changed parallel options to
>
>
> srun -n _NP_ -machinefile _HOSTS_ _EXEC_.
>
>
> I also have tried other options that I saw in the mailing list
>
>
>      srun -n _NP_
>
>
> but that didn't work. No new files are created and the dayfile is stuck
>
> in lapw0.
>
>
> case.dayfile:
> ----------------------------------
> Calculating GaAs in /lunarc/nobackup/users/eishfh/WIEN2k/test/GaAs
> on au054 with PID 189944
> using WIEN2k_14.2 (Release 15/10/2014) in
> /lunarc/nobackup/users/eishfh/SRC/Wien2k14.2-mpi
>
>
>      start       (Thu Jun  9 13:14:39 CEST 2016) with lapw0 (100/99 to go)
>
>      cycle 1     (Thu Jun  9 13:14:39 CEST 2016)         (100/99 to go)
>
>  >   lapw0 -p    (13:14:39) starting parallel lapw0 at Thu Jun  9
> 13:14:39 CEST 2016
> -------- .machine0 : 4 processors
> -------------------
>
>
> I understood from the userguide that the -p option in runsp_lapw picks up
>
> mpi version depending on the form of the .machines file. Here is the
>
> .machines file that I have used for this test calculation.
>
>
> #
> lapw0:au165 au165 au165 au165
> 1:au165 au165 au165 au165
> granularity:1
> extrafine:1
> lapw2_vector_split:2
>
> So I am wondering if anyone can tell me how can I fix the problem.
>
> Thanks,
>
> Fhokrul
>
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at<mailto:Wien at zeus.theochem.tuwien.ac.at>
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
Wien -- A Mailing list for WIEN2k users<http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
zeus.theochem.tuwien.ac.at
A Mailing list for WIEN2k users. Please post questions, suggestions or comments about WIEN2k ONLY in this list. Please follow the following "Nettiquette" (depending ...



> SEARCH the MAILING-LIST at:  <http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>

--

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at<mailto:blaha at theochem.tuwien.ac.at>    WIEN2k: <http://www.wien2k.at> http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/staff/tc_group_e.php
--------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20160623/9eb3919a/attachment.html>


More information about the Wien mailing list