[Wien] .machines for several nodes

Thu Oct 15 07:29:07 CEST 2020

Dear Professor Blaha

Thanks a lot for your responses. I have performed some additional testing, which has been delayed because I cannot run lapw0/1/2 from the command-line due to memory issues. Hence, I have had to go through the queue for each test. On top of that, I have been unable to get information about our installation. However, I finally achieved ~99% CPU efficiency with the following setup:

CPUs: 2 nodes with 24 cores each (x073 and x082)

.machines:

dstart:x073:24 x082:24
lapw0:x073:24 x082:24
1:x073:3
1:x082:3
1:x073:3
1:x082:3
1:x073:3
1:x082:3
1:x073:3
1:x082:3  #  16 lines total; 8 for each node
1:x073:3
1:x082:3
1:x073:3
1:x082:3
1:x073:3
1:x082:3
1:x073:3
1:x082:3

After creating the .machines-file I call 'mpirun run_lapw -p'. The above .machines file is basically a combination of the two examples found on page 86 of the User's Guide (without using OMP, of course). From checking the case.klist_1-16 files, I have verified that each individual job works on a different subset of the k-points. Can anyone confirm whether this setup is correct; i.e. is it a proper way to parallellize the lapw1/lapw2 cycles? Assuming the compilations of lapw0/1/2_mpi proceeded without errors, which seems to be the case.

Best regards

Christian

________________________________
Fra: Wien <wien-bounces at zeus.theochem.tuwien.ac.at> på vegne af Peter Blaha <pblaha at theochem.tuwien.ac.at>
Sendt: 13. oktober 2020 07:43:16
Til: wien at zeus.theochem.tuwien.ac.at
Emne: Re: [Wien] .machines for several nodes

To run a single program for testing, do:

x lapw0 -p

(after creation of .machines.)

Then check all error files, but in particular also the slurm-output
(whatever it is called on your machines. It probably gives some messages
like library xxxx not found or so, which is needed for additional debugging.

AND:

We still don't know how many cores your nodes have

We still don't know your compiler options (WIEN2k_OPTIONS,
parallel_options)  and if the compilation of eg. lapw0_mpi did work at
all (compile.msg in SRC_lapw0).

Am 12.10.2020 um 22:17 schrieb Christian Søndergaard Pedersen:
> Dear everybody
>
>
> I am following up on this thread to report on two separate errors in my
> attempts to properly parallellize a calculation. For the first, a
> calculation utilized 0.00% of available CPU resources. My .machines file
> looks like this:
>
>
> #
> dstart:g004:8 g010:8 g011:8 g040:8
> lapw0:g004:8 g010:8 g011:8 g040:8
> 1:g004:16
> 1:g010:16
> 1:g011:16
> 1:g040:16
>
> With my submit script calling the following commands:
>
>
> srun hostname -s > slurm.hosts
>
> run_lapw -p
>
> x qtl -p -telnes
>
>
> Of course, the job didn't reach x qtl. The resultant case.dayfile is
> short, so I am dumping all of it here:
>
>
> Calculating test-machines in /path/to/directory
> on node.host.name.dtu.dk with PID XXXXX
> using WIEN2k_19.1 (Release 25/6/2019) in
> /path/to/installation/directory/WIEN2k/19.1-intel-2019a
>
>
>      start       (Mon Oct 12 19:04:06 CEST 2020) with lapw0 (40/99 to go)
>
>      cycle 1     (Mon Oct 12 19:04:06 CEST 2020)         (40/99 to go)
>
>>   lapw0   -p  (19:04:06) starting parallel lapw0 at Mon Oct 12 19:04:06 CEST 2020
> -------- .machine0 : 32 processors
> [1] 16095
>
>
> The .machine0 file displays the lines
>
> g004 [repeated for 8 lines]
> g010 [repeated for 8 lines]
> g011 [repeated for 8 lines]
> g040 [repeated for 8 lines]
>
> which tells me that the .machines file works as intended, and that the
> cause of the problem is located somewhere else. Which brings me to the
> second error, which occured when I tried calling mpirun explicitly like so:
>
> srun hostname -s > slurm.hosts
> mpirun run_lapw -p
> mpirun qtl -p -telnes
>
> from within the job script. This crashed the job right away. The
> lapw0.error file prints out "Error in Parallel lapw0" and "check ERROR
> FILES!" a number of times. The case.clmsum file is present and looks
> correct, and the .machines file looks like the one from before (with
> different node numbers). However, the .machine0 file now looks like:
>
> g094
> g094
> g094
> g081
> g081
> g08g094
> g094
> g094
> g094
> g094
> [...]
>
> I.e. there's an error on line 6, where a node is not properly named and
> a line break is missing. The dayfile repeatedly prints out "> stop
> error" a total of sixteen times. I don't know if the above .machine0
> file is the culprit, but it seems the obvious conclusion. Any help in
> this matter will be much appreciated.
>
> Best regards
> Christian
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>

--
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:
http://www.imc.tuwien.ac.at/tc_blaha-------------------------------------------------------------------------

_______________________________________________
Wien mailing list
Wien at zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20201015/45370781/attachment.htm>