[Wien] issue with balance: keyword in .machines
Peter Blaha
peter.blaha at tuwien.ac.at
Fri May 31 08:52:27 CEST 2024
I have not used "balence" for many years. Yes, it seems to be broken.
Anyway, it was intended for a different use and is probably obsolete.
I'll remove it from the description.
In any case, some info:
"dynamic load balencing" is ONLY possible, if the SCRATCH variable is
set to ./ and not to a local scratch directory !!!
granularity:3 (or another number) could do the job (maybe together with
extrafine:1).
However, all this was developed at a time when i) the cpus were much
slower than nowadays and ii) unbalanced load was common (local clusters
without queuing systems).
Nowadays the overhead of creating new jobs, ... often "wins" over
eventual load balancing. Try it out (together with lapw2 !!)
In heterogeneous clusters you could use
5:node1
3:node2
which distributes the k-points according to this speed ratio.
Am 29.05.2024 um 21:10 schrieb Straus, Daniel B:
> Hi,
>
> I am trying to use the /balance:/ keyword in .machines to allocate
> k-points on-the-fly rather than all at once because even though all
> nodes on the cluster I use are identical, some k-points are computed
> much faster than others.
>
> When running a band structure calculation (x lapw1 -band -up –p), after
> the initial k-points are allocated, there is a bug related to assigning
> remaining k-points. This was run on 4 nodes. Once the first job ends, it
> tries to access a .machine (.machine5) file that doesn’t exist—there are
> only 4 nodes, so only four .machine* files are created. The job still
> completes correctly, so I’m not sure if this is just a logging issue or
> if the subsequent k-points are being assigned to the wrong nodes.
>
> case.klist_band had had 102 k points in it
>
> Here is the relevant STDOUT:
>
> running LAPW1 in parallel mode (using .machines)
>
> 4 number_of_parallel_jobs
>
> [1] 45240
>
> [2] 45278
>
> [3] 45313
>
> [4] 45344
>
> LAPW1 END
>
> [3] Done ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> sort: open failed: .machine5: No such file or directory
>
> [5] 111735
>
> LAPW1 END
>
> [2] Done ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> sort: open failed: .machine6: No such file or directory
>
> [6] 111966
>
> LAPW1 END
>
> [5] Done ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> LAPW1 END
>
> LAPW1 END
>
> LAPW1 END
>
> [6] + Done ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> [4] + Done ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> [1] + Done ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> .machines file is as follows:
>
> #
>
> omp_global:5 #(this is 5 because there are two 10 core
> processors, and I run 4 mpi jobs per node)
>
> lapw0:host1:4 host2:4 host3:4 host4:4
>
> balance:
>
> 1:host1:4
>
> 1:host2:4
>
> 1:host3:4
>
> 1:host4:4
>
> granularity:1
>
> extrafine:1
>
> I’ve removed the balance: keyword from my .machines file for now.
>
> Daniel
>
> Daniel Straus
>
> Assistant Professor
>
> Department of Chemistry
>
> Tulane University
>
> 5088 Percival Stern Hall
>
> 6400 Freret Street
>
> New Orleans, LA 70118
>
> (504) 862-3585
>
> http://straus.tulane.edu/ <http://straus.tulane.edu/>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
--
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300
Email: peter.blaha at tuwien.ac.at WIEN2k: http://www.wien2k.at
WWW: http://www.imc.tuwien.ac.at
-------------------------------------------------------------------------
More information about the Wien
mailing list