[Wien] MPI error with 2 nodes (40 cores)

Md. Fhokrul Islam fislam at hotmail.com
Thu Dec 8 22:05:35 CET 2016


Hi Prof. Blaha,


Sorry about the confusion with subject of the email. It is not a spin-orbit calculation,

just a normal scf calculation. I was using one my previous emails and forgot to change

the subject line before I sent it to the user group.


This job is a surface supercell calculation with 360 atoms. With 20 cores it takes about

3 hours for each scf cycle, so I was trying to test if 40 cores speedup calculations.

Eventually, I will have to run spin-orbit calculations for some jobs of similar size, which

will take even more time. So I need to speedup the calculations.


During compilation of MPI version (see below) I have used shared memory architecture,

which is true for 1 node with 20 cores. But each nodes are physically separated from one

another and  don't share common memory. So I am wondering if the error is related to my

choice of shared memory architecture. If it it the case, should I recompile Wien2k with

no shared memory option?




 *  Configure parallel execution  *
   **********************************

   These options are stored in   parallel_options  of WIENROOT
   You can change them later also manually.

   Do you use ONLY a shared memory parallel architecture (ONE single multi-core
   node)  ?

   On shared memory system it is normally better to start jobs in the
   background rather than using remote commands. If you select a shared memory
   system WIEN will by default not use remote shell commands
   (USE_REMOTE and MPI_REMOTE = 0 in parallel_options)
   and set the default granularity to 1.

   You still can override this default granularity in your .machines file.

   You may also set a specific TASKSET command to bind your executables
   to a specific core on multicore machines.
  Shared Memory Architecture? (y/N):y
  Do you know/need a command to bind your jobs to specific nodes ?
  (like taskset -c). Enter N / your_specific_command:

Thanks,
Fhokrul

________________________________
From: Wien <wien-bounces at zeus.theochem.tuwien.ac.at> on behalf of Peter Blaha <pblaha at theochem.tuwien.ac.at>
Sent: Thursday, December 8, 2016 6:47 PM
To: A Mailing list for WIEN2k users
Subject: Re: [Wien] lapwso_mpi error

What kind of job is it, that lapw0_mpi runs for 9800 seconds ???

Is there any speedup when using 40 instead of 20 cores ?

Your error is in lapw1_mpi, not in lapwso_mpi ???

No idea about your software, but I doubt that it is wien2k.

Am 08.12.2016 um 16:56 schrieb Md. Fhokrul Islam:
> Hi Prof Blaha,
>
> I am trying to run an MPI job in 2 nodes each with 20 cores. But the job
> crashes
> with the following error messages. I have tried with both USE_REMOTE 0 and
> USE_REMOTE 1 in parallel_options file but didn't make much of a deference.
> Our system administrator told me it is not probably not a hardware issue
> and
> suggested me to contact Wien2k. So could you please let me know if I
> need to
> make any change in MPI setting and recompileWien2k.
>
> By the way, the same job runs fine if I use only 1 node with 20 cores.
>
> Error message:
>
> case.dayfile
>
>    cycle 1     (Thu Dec  8 15:44:06 CET 2016)  (100/99 to go)
>
>>   lapw0 -p    (15:44:06) starting parallel lapw0 at Thu Dec  8
> 15:44:07 CET 2016
> -------- .machine0 : 40 processors
> 9872.562u 20.276s 8:20.46 1976.7%       0+0k 220752+386840io 332pf+0w
>>   lapw1  -up -p    -c         (15:52:27) starting parallel lapw1 at
> Thu Dec  8 15:52:27 CET 2016
> ->  starting parallel LAPW1 jobs at Thu Dec  8 15:52:27 CET 2016
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
>      au039 au039 au039 au039 au039 au039 au039 au039 au039 au039 au039
> au039 au039 au039 au039 au039 au039 au039 au039 au039 au042 au042 au042
> au042 au042 au042 au042 au042 au042 au042 au042 au042 au042 au042 au042
> au042 au042 au042 au042 au042(1)
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 8 in communicator MPI_COMM_WORLD
> with errorcode -726817712.
>
>
> Output error file:
>
>  LAPW0 END
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
> forrtl: Interrupted system call
> w2k_dispatch_signal(): received: Terminated
> w2k_dispatch_signal(): received: Terminated
>
>
> Thanks,
> Fhokrul
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
Wien -- A Mailing list for WIEN2k users<http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
zeus.theochem.tuwien.ac.at
A Mailing list for WIEN2k users. Please post questions, suggestions or comments about WIEN2k ONLY in this list. Please follow the following "Nettiquette" (depending ...



> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>

--
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/staff/tc_group_e.php
Institute Technische Universität Wien : Fehler 404 - Seite nicht gefunden<http://www.imc.tuwien.ac.at/staff/tc_group_e.php>
www.imc.tuwien.ac.at
Technische Universität Wien, TU Wien



--------------------------------------------------------------------------
_______________________________________________
Wien mailing list
Wien at zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
Wien -- A Mailing list for WIEN2k users<http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
zeus.theochem.tuwien.ac.at
A Mailing list for WIEN2k users. Please post questions, suggestions or comments about WIEN2k ONLY in this list. Please follow the following "Nettiquette" (depending ...



SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
Messages by Thread - The Mail Archive<http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html>
www.mail-archive.com
Messages by Thread [Wien] No convergence during Volume Optimization Abderrahmane Reggad. Re: [Wien] No convergence during Volume Optimization pieper



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20161208/9832e2ff/attachment.html>


More information about the Wien mailing list