[Wien] qtl: error reading parallel vectors

Sun Oct 25 01:19:25 CEST 2020

Regarding [1], I did expect that you would have to submit the commands 
within your job script via the SLURM workload manager on your system 
with something like [5,6]

      sbatch my_job_script.job

      or by whatever method you have to use on your system. Where, the 
commands at [7] are in the job file, such as:

      my_job_script.job

      -------------------------------------

      #!/bin/bash

      # ...

      run_lapw -p
      x -qtl -p -telnes
      x telnes3

      -------------------------------------

     In my case, I don't have SLURM.  So I'm unable to do any testing in 
that environment.  Maybe someone else in the mailing list has a SLURM 
system that check if they are encountering the same problem that you are 
having.

     [5] 
https://www.hpc2n.umu.se/documentation/batchsystem/basic-submit-example-scripts

     [6] https://doku.lrz.de/display/PUBLIC/WIEN2k

     [7] 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg20597.html

Regarding [2], good to read that mpi parallel with "x -qtl -p -telnes" 
works fine on your system with Vanadium Dioxide (VO2). If you have 
control of what nodes the calculation will run on, does the VO2 run fine 
on your 1st node (e.g., x073 [8]) with multiple cores of a single CPU, 
then does it run fine on the 2nd node (e.g., x082) with multiple cores 
of a single CPU?  I have read at [9] that some schedule managers 
automatically assign the nodes on the fly such that the user might have 
no control in some case on which nodes the job will run on.  Does the 
VO2 run fine with mpi parallel using 1 processor core on node 1 and 1 
processor core on node 2, if your able to control that as it may help to 
narrow down the problem?

     [8] 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg20617.html

     [9] http://susi.theochem.tuwien.ac.at/reg_user/faq/pbs.html

Regarding [3], the output you posted looks as expected.  So nothing 
wrong with that.

     In the past, I posted in the mailing list some things that I found 
helpful for troubleshooting parallel issues, but you would have to 
search the mailing list to find them.  I believe a couple of them may 
have been at the following two links:

   [10] 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg17973.html

   [11] 
http://zeus.theochem.tuwien.ac.at/pipermail/wien/2018-April/027944.html

Lastly, I have now tried a WIEN2k 19.2 calculation using mpi parallel on 
my system with the struct file at 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg20645.html .

It looks like it ran fine when it was set to run on two of the four 
processors on my system:

username at computername:~/wiendata/diamond$ ls ~/wiendata/scratch
username at computername:~/wiendata/diamond$ ls
diamond.struct
username at computername:~/wiendata/diamond$ init_lapw -b
...
username at computername:~/wiendata/diamond$ cat $WIENROOT/parallel_options
setenv TASKSET "no"
if ( ! $?USE_REMOTE ) setenv USE_REMOTE 1
if ( ! $?MPI_REMOTE ) setenv MPI_REMOTE 1
setenv WIEN_GRANULARITY 1
setenv DELAY 0.1
setenv SLEEPY 1
username at computername:~/wiendata/diamond$ cat .machines
1:localhost:2
granularity:1
extrafine:1
username at computername:~/wiendata/diamond$ run_lapw -p
...
in cycle 11    ETEST: .0001457550000000   CTEST: .0033029
hup: Command not found.
STOP  LAPW0 END
STOP  LAPW1 END

real    0m6.744s
user    0m12.679s
sys    0m0.511s
STOP LAPW2 - FERMI; weights written
STOP  LAPW2 END

real    0m1.123s
user    0m1.785s
sys    0m0.190s
STOP  SUMPARA END
STOP  CORE  END
STOP  MIXER END
ec cc and fc_conv 1 1 1

 >   stop
username at computername:~/wiendata/diamond$ cp 
$WIENROOT/SRC_templates/case.innes diamond.innes
username at computername:~/wiendata/diamond$ x qtl -p -telnes
running QTL in parallel mode
calculating QTL's from parallel vectors
STOP  QTL END
6.5u 0.0s 0:06.77 98.3% 0+0k 928+8080io 4pf+0w
username at computername:~/wiendata/diamond$ cat diamond.inq
0 2.20000000000000000000
1
1 99 1 0
4 0 1 2 3
username at computername:~/wiendata/diamond$ x telnes3
STOP TELNES3 DONE
3.2u 0.0s 0:03.39 98.8% 0+0k 984+96io 3pf+0w
username at computername:~/wiendata/diamond$ ls -l ~/wiendata/scratch
total 624
-rw-rw-r-- 1 username username      0 Oct 24 15:40 diamond.vector
-rw-rw-r-- 1 username username 637094 Oct 24 15:43 diamond.vector_1
-rw-rw-r-- 1 username username      0 Oct 24 15:44 diamond.vectordn
-rw-rw-r-- 1 username username      0 Oct 24 15:44 diamond.vectordn_1

On 10/24/2020 2:30 PM, Christian Søndergaard Pedersen wrote:
>
> Hello Gavin
>
>
> Thanks for your reply, and apologies for my tardiness.
>
>
> [1] All my calculations are run in MPI-parallel on our HPC cluster. I 
> cannot execute any 'x lapw[0,1,2] -p' command in the terminal (on the 
> cluster login node); this results in 'pbsssh: command not found'. 
> However, submitting via the SLURM workload manager works fine. In all 
> my submit scripts, I specify 'setenv SCRATCH /scratch/$USER', which is 
> the proper location of scratch storage on our HPC cluster.
>
>
> [2] Without having tried your example for diamond, I can report that 
> 'run_lapw -p' followed by 'x qtl -p -telnes' works without problems 
> for a single cell of Vanadium dioxide. However, for other systems I 
> get the error I specified. The other systems (1) are larger, and (2) 
> use two CPU's instead of a single CPU (.machines file are modified 
> suitably).
>
> Checking the qtl.def file for the calculation that _did_ work, I can 
> see that the line specifying '/scratch/chrsop/VO2.vectordn' is _also_ 
> present here, so this is not to blame. This leaves me baffled as to 
> what the error can be - as far as I can tell, I am trying to perform 
> the exact same calculation for different systems. I thought maybe 
> insufficient scratch storage could be to blame, but this would most 
> likely show up in the 'run_lapw' cycles (I believe).
>
>
> [3] I am posting here the difference between qtlpara and lapw2para:
>
> $ grep "single" $WIENROOT/qtlpara_lapw
> testinput .processes single
> $ grep "single" $WIENROOT/lapw2para_lapw
> testinput .processes single
> single:
> echo "running in single mode"
>
> ... if this is wrong, I kindly request advice on how to fix it, so I 
> can pass it on to our software maintenance guy. If there's anything 
> else I can try please let me know.
>
> Best regards
> Christian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20201024/ed4a4d6d/attachment.htm>