[Wien] qtl: error reading parallel vectors
Gavin Abo
gsabo at crimson.ua.edu
Sun Oct 25 01:19:25 CEST 2020
Regarding [1], I did expect that you would have to submit the commands
within your job script via the SLURM workload manager on your system
with something like [5,6]
sbatch my_job_script.job
or by whatever method you have to use on your system. Where, the
commands at [7] are in the job file, such as:
my_job_script.job
-------------------------------------
#!/bin/bash
# ...
run_lapw -p
x -qtl -p -telnes
x telnes3
-------------------------------------
In my case, I don't have SLURM. So I'm unable to do any testing in
that environment. Maybe someone else in the mailing list has a SLURM
system that check if they are encountering the same problem that you are
having.
[5]
https://www.hpc2n.umu.se/documentation/batchsystem/basic-submit-example-scripts
[6] https://doku.lrz.de/display/PUBLIC/WIEN2k
[7]
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg20597.html
Regarding [2], good to read that mpi parallel with "x -qtl -p -telnes"
works fine on your system with Vanadium Dioxide (VO2). If you have
control of what nodes the calculation will run on, does the VO2 run fine
on your 1st node (e.g., x073 [8]) with multiple cores of a single CPU,
then does it run fine on the 2nd node (e.g., x082) with multiple cores
of a single CPU? I have read at [9] that some schedule managers
automatically assign the nodes on the fly such that the user might have
no control in some case on which nodes the job will run on. Does the
VO2 run fine with mpi parallel using 1 processor core on node 1 and 1
processor core on node 2, if your able to control that as it may help to
narrow down the problem?
[8]
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg20617.html
[9] http://susi.theochem.tuwien.ac.at/reg_user/faq/pbs.html
Regarding [3], the output you posted looks as expected. So nothing
wrong with that.
In the past, I posted in the mailing list some things that I found
helpful for troubleshooting parallel issues, but you would have to
search the mailing list to find them. I believe a couple of them may
have been at the following two links:
[10]
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg17973.html
[11]
http://zeus.theochem.tuwien.ac.at/pipermail/wien/2018-April/027944.html
Lastly, I have now tried a WIEN2k 19.2 calculation using mpi parallel on
my system with the struct file at
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg20645.html .
It looks like it ran fine when it was set to run on two of the four
processors on my system:
username at computername:~/wiendata/diamond$ ls ~/wiendata/scratch
username at computername:~/wiendata/diamond$ ls
diamond.struct
username at computername:~/wiendata/diamond$ init_lapw -b
...
username at computername:~/wiendata/diamond$ cat $WIENROOT/parallel_options
setenv TASKSET "no"
if ( ! $?USE_REMOTE ) setenv USE_REMOTE 1
if ( ! $?MPI_REMOTE ) setenv MPI_REMOTE 1
setenv WIEN_GRANULARITY 1
setenv DELAY 0.1
setenv SLEEPY 1
username at computername:~/wiendata/diamond$ cat .machines
1:localhost:2
granularity:1
extrafine:1
username at computername:~/wiendata/diamond$ run_lapw -p
...
in cycle 11 ETEST: .0001457550000000 CTEST: .0033029
hup: Command not found.
STOP LAPW0 END
STOP LAPW1 END
real 0m6.744s
user 0m12.679s
sys 0m0.511s
STOP LAPW2 - FERMI; weights written
STOP LAPW2 END
real 0m1.123s
user 0m1.785s
sys 0m0.190s
STOP SUMPARA END
STOP CORE END
STOP MIXER END
ec cc and fc_conv 1 1 1
> stop
username at computername:~/wiendata/diamond$ cp
$WIENROOT/SRC_templates/case.innes diamond.innes
username at computername:~/wiendata/diamond$ x qtl -p -telnes
running QTL in parallel mode
calculating QTL's from parallel vectors
STOP QTL END
6.5u 0.0s 0:06.77 98.3% 0+0k 928+8080io 4pf+0w
username at computername:~/wiendata/diamond$ cat diamond.inq
0 2.20000000000000000000
1
1 99 1 0
4 0 1 2 3
username at computername:~/wiendata/diamond$ x telnes3
STOP TELNES3 DONE
3.2u 0.0s 0:03.39 98.8% 0+0k 984+96io 3pf+0w
username at computername:~/wiendata/diamond$ ls -l ~/wiendata/scratch
total 624
-rw-rw-r-- 1 username username 0 Oct 24 15:40 diamond.vector
-rw-rw-r-- 1 username username 637094 Oct 24 15:43 diamond.vector_1
-rw-rw-r-- 1 username username 0 Oct 24 15:44 diamond.vectordn
-rw-rw-r-- 1 username username 0 Oct 24 15:44 diamond.vectordn_1
On 10/24/2020 2:30 PM, Christian Søndergaard Pedersen wrote:
>
> Hello Gavin
>
>
> Thanks for your reply, and apologies for my tardiness.
>
>
> [1] All my calculations are run in MPI-parallel on our HPC cluster. I
> cannot execute any 'x lapw[0,1,2] -p' command in the terminal (on the
> cluster login node); this results in 'pbsssh: command not found'.
> However, submitting via the SLURM workload manager works fine. In all
> my submit scripts, I specify 'setenv SCRATCH /scratch/$USER', which is
> the proper location of scratch storage on our HPC cluster.
>
>
> [2] Without having tried your example for diamond, I can report that
> 'run_lapw -p' followed by 'x qtl -p -telnes' works without problems
> for a single cell of Vanadium dioxide. However, for other systems I
> get the error I specified. The other systems (1) are larger, and (2)
> use two CPU's instead of a single CPU (.machines file are modified
> suitably).
>
> Checking the qtl.def file for the calculation that _did_ work, I can
> see that the line specifying '/scratch/chrsop/VO2.vectordn' is _also_
> present here, so this is not to blame. This leaves me baffled as to
> what the error can be - as far as I can tell, I am trying to perform
> the exact same calculation for different systems. I thought maybe
> insufficient scratch storage could be to blame, but this would most
> likely show up in the 'run_lapw' cycles (I believe).
>
>
> [3] I am posting here the difference between qtlpara and lapw2para:
>
> $ grep "single" $WIENROOT/qtlpara_lapw
> testinput .processes single
> $ grep "single" $WIENROOT/lapw2para_lapw
> testinput .processes single
> single:
> echo "running in single mode"
>
> ... if this is wrong, I kindly request advice on how to fix it, so I
> can pass it on to our software maintenance guy. If there's anything
> else I can try please let me know.
>
> Best regards
> Christian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20201024/ed4a4d6d/attachment.htm>
More information about the Wien
mailing list