[Wien] .machines for several nodes
Christian Søndergaard Pedersen
chrsop at dtu.dk
Mon Oct 12 22:17:02 CEST 2020
Dear everybody
I am following up on this thread to report on two separate errors in my attempts to properly parallellize a calculation. For the first, a calculation utilized 0.00% of available CPU resources. My .machines file looks like this:
#
dstart:g004:8 g010:8 g011:8 g040:8
lapw0:g004:8 g010:8 g011:8 g040:8
1:g004:16
1:g010:16
1:g011:16
1:g040:16
With my submit script calling the following commands:
srun hostname -s > slurm.hosts
run_lapw -p
x qtl -p -telnes
Of course, the job didn't reach x qtl. The resultant case.dayfile is short, so I am dumping all of it here:
Calculating test-machines in /path/to/directory
on node.host.name.dtu.dk with PID XXXXX
using WIEN2k_19.1 (Release 25/6/2019) in /path/to/installation/directory/WIEN2k/19.1-intel-2019a
start (Mon Oct 12 19:04:06 CEST 2020) with lapw0 (40/99 to go)
cycle 1 (Mon Oct 12 19:04:06 CEST 2020) (40/99 to go)
> lapw0 -p (19:04:06) starting parallel lapw0 at Mon Oct 12 19:04:06 CEST 2020
-------- .machine0 : 32 processors
[1] 16095
The .machine0 file displays the lines
g004 [repeated for 8 lines]
g010 [repeated for 8 lines]
g011 [repeated for 8 lines]
g040 [repeated for 8 lines]
which tells me that the .machines file works as intended, and that the cause of the problem is located somewhere else. Which brings me to the second error, which occured when I tried calling mpirun explicitly like so:
srun hostname -s > slurm.hosts
mpirun run_lapw -p
mpirun qtl -p -telnes
from within the job script. This crashed the job right away. The lapw0.error file prints out "Error in Parallel lapw0" and "check ERROR FILES!" a number of times. The case.clmsum file is present and looks correct, and the .machines file looks like the one from before (with different node numbers). However, the .machine0 file now looks like:
g094
g094
g094
g081
g081
g08g094
g094
g094
g094
g094
[...]
I.e. there's an error on line 6, where a node is not properly named and a line break is missing. The dayfile repeatedly prints out "> stop error" a total of sixteen times. I don't know if the above .machine0 file is the culprit, but it seems the obvious conclusion. Any help in this matter will be much appreciated.
Best regards
Christian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20201012/bafbb9ea/attachment.htm>
More information about the Wien
mailing list