[Wien] Parallel LAPW1 job suspended without any error message

Peter Blaha peter.blaha at tuwien.ac.at
Wed Nov 15 09:15:18 CET 2023


The .machines file you show is for k-parallelization on the local host only.
Thus:
i) mpi is not used and all mpi settings are irrelevant for this.
ii) The k-point parallelization is stirred by the variable USE_REMOTE in 
$WIENROOT/WIEN2k_parallel_options

If set to 0, you can run ONLY on your localhost. It will simply start N 
lapw1 lapw1_n.def jobs in the background. Nothing else is needed.

If set to 1, you can run on the local host and on remote hosts, when you 
meet the following requirements:
i) a common NFS file system, i.e. your data must be available under the 
same path on all nodes.
ii) you need passwordless  ssh  (or what you have configured during 
siteconfig), i.e.  a command   like   ssh localhost hostname
must execute without any further input/confirmation (to all nodes you 
specified)

This can be done using "keys", (see 5.5.1 in the UG).

I'd expect that when    x lapw1 -p  is hanging, you would see 4
ssh localhost ...
commands which are waiting forever using   ps -ef|grep ssh

PS: WIEN2k_19 is outdated, I strongly recommend using 23.2. It has a 
much better  initialization and produces more efficient input files.


Am 15.11.2023 um 08:09 schrieb heungsikim at kangwon.ac.kr:
> Dear Wien2k users,
> 
> I’ve recently encountered a strange situation in parallel execution of 
> Wien2k (version 19). Normally I run wien2k jobs using OpenMP and they 
> works without any trouble. But recently there has been a project that I 
> need to run wien2k using k-point parallelization, and I am having a 
> trouble that I couldn’t solve.
> 
> Issue:
> 
>   * When running wien2k using k-point parallelization (with the -p
>     option in run_lapw and .machines file), the job suspends at the
>     lapw1 stage and does not produce any lapw1 output (such as
>     case.vector_* files) or error messages.
>   * Terminating the job and running the command “x lapw1 -p” reproduces
>     the symptom. Checking active processes in the compute node while the
>     “x lapw1 -p” command is on does now show any lapw1 jobs running,
>     except the signature of suspended lapw1para script.
>   * Removing the -p option and running in serial or using OpenMP
>     multithreads work totally OK.
> 
> Further info. on my system:
> 
>   * Wien2k version: 19.1 (also unofficially tried with version 23, the
>     same problem persists)
>   * System: Ubuntu 20.04 LTS
>   * Compiler, math library: Intel oneapi 2023 version, with intel icc,
>     ifort, mpiifort, and MKL (lapack, blacs, scalapack).
>   * FFTW: FFTW3, compiled using intel compilers from source (ver. 3.3.8)
>   * MPI: Intel MPI included in the Intel oneapi package, and with
>     MPI_REMOTE = 0
>       o Tried both using / not using mpi parallelization. The same lapw1
>         suspension persists.
> 
> My .machines file looks like below (for a 4 core test job):
> ----
> granularity:1
> 1:localhost
> 1:localhost
> 1:localhost
> 1:localhost
> extrafine:1
> ----
> 
> I checked that, after running x lapw1 -p, a list of case.klist_* files 
> and lapw1_*.def files are created in the working directory (and also 
> “.machine* files). Running each of k-divided case using lapw1 (for 
> example, using commands like “lapw1 lapw1_1.def”) works fine and creates 
> case.vector_* files correctly. Strangely, actual "x lapw1 -p" (or 
> “lapw1para_lapw lapw1.def”) does not enter the lapw1-running stage and 
> suspends somewhere before that.
> 
> Because this suspension does not create any error or other messages, I 
> have no idea on how to solve this issue. Currently what I tried are as 
> follows:
> 
>   * Recompiling wien2k without any mpi-related options (which means,
>     even with setting MPI_REMOTE to be 1)
>   * Tuning DELAY and SLEEPY in lapw1para
>   * Running the parallel job on a local storage (not on a NFS storage)
>   * As mentioned above, using newer wien2k version 23 (just as a test
>     purpose! I am not producing any scientific results with that)
>   * Removing fftw3. But this should not matter, because lapw1 does not
>     seem to use fftw
> 
> which all were not successful in rectifying the issue.
> 
> I tried searching the previous wien2k mailing list, I might missed, but 
> I couldn’t find any issue similar to mine. Any of your comments will be 
> highly appreciated!
> 
> Best regards,
> Heung-Sik
> 
> ---
> *Heung-Sik Kim*
> /Assistant Professor//
> Department of Physics
> Kangwon National University/
> email: heungsikim at kangwon.ac.kr <mailto:heungsikim at kangwon.ac.kr>
> https://sites.google.com/view/heungsikim/ 
> <https://sites.google.com/view/heungsikim/>
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

-- 
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300
Email: peter.blaha at tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at
-------------------------------------------------------------------------


More information about the Wien mailing list