[Wien] issue with balance: keyword in .machines

Straus, Daniel B dstraus at tulane.edu
Wed May 29 21:10:05 CEST 2024


Hi,

I am trying to use the balance: keyword in .machines to allocate k-points on-the-fly rather than all at once because even though all nodes on the cluster I use are identical, some k-points are computed much faster than others.

When running a band structure calculation (x lapw1 -band -up -p), after the initial k-points are allocated, there is a bug related to assigning remaining k-points. This was run on 4 nodes. Once the first job ends, it tries to access a .machine (.machine5) file that doesn't exist-there are only 4 nodes, so only four .machine* files are created. The job still completes correctly, so I'm not sure if this is just a logging issue or if the subsequent k-points are being assigned to the wrong nodes.

case.klist_band had had 102 k points in it

Here is the relevant STDOUT:
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
[1] 45240
[2] 45278
[3] 45313
[4] 45344
LAPW1 END
[3]    Done                          ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
sort: open failed: .machine5: No such file or directory
[5] 111735
LAPW1 END
[2]    Done                          ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
sort: open failed: .machine6: No such file or directory
[6] 111966
LAPW1 END
[5]    Done                          ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
LAPW1 END
LAPW1 END
LAPW1 END
[6]  + Done                          ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
[4]  + Done                          ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
[1]  + Done                          ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .time1_$loop

.machines file is as follows:

                #
                omp_global:5 #(this is 5 because there are two 10 core processors, and I run 4 mpi jobs per node)
                lapw0:host1:4 host2:4 host3:4 host4:4
                balance:
1:host1:4
1:host2:4
1:host3:4
1:host4:4
granularity:1
                extrafine:1

I've removed the balance: keyword from my .machines file for now.

Daniel

Daniel Straus
Assistant Professor
Department of Chemistry
Tulane University
5088 Percival Stern Hall
6400 Freret Street
New Orleans, LA 70118
(504) 862-3585
http://straus.tulane.edu/


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20240529/ed4aa480/attachment-0001.htm>


More information about the Wien mailing list