[Wien] Need extensive help for a job file for slurm job scheduler cluster

Laurence Marks laurence.marks at gmail.com
Sun Nov 15 10:14:41 CET 2020


If your sys admin does not want to install bc, you can install it yourself
into ~/bin if this is nfs mounted. Or some other nfs mounted directory that
you add to your PATH or is already included. Do a Google search on "linux
how to install bc".

Wien2k requires a working bc, and standard linux commands on all nodes.
There is no way around this short of a massive rewrite (which nobody will
do for you).

_____
Professor Laurence Marks
"Research is to see what everybody else has seen, and to think what nobody
else has thought", Albert Szent-Gyorgi
www.numis.northwestern.edu

On Sun, Nov 15, 2020, 02:19 Tran, Fabien <fabien.tran at tuwien.ac.at> wrote:

> The probably simplest solution is to ask the system administrator to
> install bc on all nodes:
> ​​
> https://urldefense.com/v3/__https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg17679.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!A8k2iWg8Er5xRYR-yIFE18w7dC40LekNfBwq17MhDvT_aN0YCI47gIpy1J4evqZPKah9Bg$
>
>
> From: Wien <wien-bounces at zeus.theochem.tuwien.ac.at> on behalf of Dr. K.
> C. Bhamu <kcbhamu85 at gmail.com>
> Sent: Sunday, November 15, 2020 8:07 AM
> To: A Mailing list for WIEN2k users
> Subject: Re: [Wien] Need extensive help for a job file for slurm job
> scheduler cluster
>
>
>
> Additional information (maybe this is the main cause of the lapw1 crash):
> bc is only working on the head node. node11 or other clint nodes are not
> having bc installed.
> If the bc is only the issue then is it possible to modify the job file
> such that it uses bc on the head node only.
>
>
> Thank you
> Bhamu
>
>
> On Sun, Nov 15, 2020 at 12:25 PM Dr. K. C. Bhamu <kcbhamu85 at gmail.com>
> wrote:
>
>
>
> Dear Gavin and Prof. Marks
> Thank you for your inputs.
> qsub MyJobFIle.job creates the .machines file.
>
>
> With the below given  job file, I could create the proper .machine files
> (equal to number of cores in the node and .machines file) but  lapw1
> always crashes
>
>
> case.dayfile is
>
> Calculating pbe in /home/kcbhamu/work/test/pbe
> on node11 with PID 9241
> using WIEN2k_19.1 (Release 25/6/2019) in /home/kcbhamu/soft/w2k192
>
>
>     start (Sun Nov 15 15:42:05 KST 2020) with lapw0 (40/99 to go)
>
>     cycle 1 (Sun Nov 15 15:42:05 KST 2020) (40/99 to go)
>
> >   lapw0   -p (15:42:05) starting parallel lapw0 at Sun Nov 15 15:42:05
> KST 2020
> -------- .machine0 : processors
> running lapw0 in single mode
> 7.281u 0.272s 0:07.64 98.8% 0+0k 1000+1216io 0pf+0w
> >   lapw1  -p     (15:42:13) starting parallel lapw1 at Sun Nov 15
> 15:42:13 KST 2020
> ->  starting parallel LAPW1 jobs at Sun Nov 15 15:42:13 KST 2020
> running LAPW1 in parallel mode (using .machines)
> 16 number_of_parallel_jobs
> 0.200u 0.369s 0:00.59 94.9% 0+0k 208+456io 0pf+0w
> error: command   /home/kcbhamu/soft/w2k192/lapw1para lapw1.def   failed
>
> >   stop error
>
>
>
> the job.eout file indicates below error:
>
>
> But I am getting below error
>
>
> bc: Command not found.
>  LAPW0 END
> bc: Command not found.
> number_per_job: Subscript out of range.
> grep: *scf1*: No such file or directory
> grep: lapw2*.error: No such file or directory
>
>
>
> .machines file is give below
>
>
>
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> granularity:1
> extrafine:1
>
>
>
>
>
> parallel_options file
> setenv TASKSET "no"
> if ( ! $?USE_REMOTE ) setenv USE_REMOTE 0
> if ( ! $?MPI_REMOTE ) setenv MPI_REMOTE 0
> setenv WIEN_GRANULARITY 1
> setenv DELAY 0.1
> setenv SLEEPY 1
> setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"
> setenv CORES_PER_NODE 16
>
>
>
> job file
>
>
> #!/bin/sh
> #SBATCH -J test
> #SBATCH -p 52core    # THis is the name of the partition.
> #SBATCH -N 1
> #SBATCH -n 16
> #SBATCH -o %x.o%j
> #SBATCH -e %x.e%j
> #export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
>
> export OMP_NUM_THREADS=16     # I have check with 1,2 4, 8 also.
>
> # Use , as list separator
> IFS=','
> # Convert string to array
> hcpus=($SLURM_JOB_CPUS_PER_NODE)
> unset IFS
>
> declare -a conv
>
> # Expand compressed slurm array
> for cpu in ${hcpus[@]}; do
>      if [[ $cpu =~ (.*)\((.*)x\) ]]; then
> # found compressed value
> value=${BASH_REMATCH[1]}
> factor=${BASH_REMATCH[2]}
> for j in $(seq 1 $factor); do
>    conv=( ${conv[*]} $value )
> done
>      else
> conv=( ${conv[*]} $cpu )
>      fi
> done
>
> # Build .machines file
> rm -f .machines
>
> nhost=0
>
> echo ${conv[@]};
>
> IFS=','
> for node in $SLURM_NODELIST
> do
>     declare -i cpuspernode=${conv[$nhost]};
>     for ((i=0; i<${cpuspernode}; i++))
>     do
> echo 1:$node >> .machines
>     done
>     let nhost+=1
> done
>
> echo 'granularity:1' >>.machines
> echo 'extrafine:1' >>.machines
>
>
> run_lapw -p
>
>
>
>
>
> Thank you very much
>
>
> Regards
> Bhamu
>
>
>
>
>
>
> On Fri, Nov 13, 2020 at 7:04 PM Gavin Abo <gsabo at crimson.ua.edu> wrote:
>
> If you have a look at [1], it can be seen that different cluster systems
> have different commands for job submission.
> I did not see it clearly shown in your post how the job was submitted, for
> example did you maybe use something similar to that at [2]:
> $ sbatch MyJobScript.sh
>
> What command creates your .machines file?
>
> In your MyJobScript.sh below, I'm not seeing any lines that create a
> .machines file.
>
>  MyJobScript.sh
>
> --------------------------------------------------------------------------------------------------------
> #!/bin/sh
> #SBATCH -J test #job name
> #SBATCH -p 44core #partition name
> #SBATCH -N 1 #node
> #SBATCH -n 18 #core
> #SBATCH -o %x.o%j
> #SBATCH -e %x.e%j
> export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so #Do not change here!!
> srun ~/soft/qe66/bin/pw.x  < case.in > case.out
> --------------------------------------------------------------------------------------------------------
> The available jobs files on FAQs are not working. They give me
> .machine0          .machines          .machines_current   files only
> wherein .machines has # and the other two are empty.
>
> In the Slurm documentation at [3], it looks like there is variable for
> helping creating a list of nodes on the fly that would need to be written
> to the .machines file:
> SLURM_JOB_NODELIST (and SLURM_NODELIST for backwards compatibility)
>
> I'm not seeing this in your MyJobScript.sh like that seen in other job
> scripts found on the Internet, for example [4-7].
>
>  [1]
> https://urldefense.com/v3/__https://slurm.schedmd.com/rosetta.pdf__;!!Dq0X2DkFhyF93HkjWTBQKhk!A8k2iWg8Er5xRYR-yIFE18w7dC40LekNfBwq17MhDvT_aN0YCI47gIpy1J4evqYGvGvdRw$
> [2]
> https://urldefense.com/v3/__https://hpc-uit.readthedocs.io/en/latest/jobs/examples.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!A8k2iWg8Er5xRYR-yIFE18w7dC40LekNfBwq17MhDvT_aN0YCI47gIpy1J4evqa4_WIupQ$
> [3]
> https://urldefense.com/v3/__https://slurm.schedmd.com/sbatch.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!A8k2iWg8Er5xRYR-yIFE18w7dC40LekNfBwq17MhDvT_aN0YCI47gIpy1J4evqZivNWp_g$
> [4]
> https://urldefense.com/v3/__https://itp.uni-frankfurt.de/wiki-it/index.php/Wien2k__;!!Dq0X2DkFhyF93HkjWTBQKhk!A8k2iWg8Er5xRYR-yIFE18w7dC40LekNfBwq17MhDvT_aN0YCI47gIpy1J4evqbx3r5fSw$
> [5]
> https://urldefense.com/v3/__https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg15511.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!A8k2iWg8Er5xRYR-yIFE18w7dC40LekNfBwq17MhDvT_aN0YCI47gIpy1J4evqZWR8-nPw$
> [6]
> https://urldefense.com/v3/__https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg07097.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!A8k2iWg8Er5xRYR-yIFE18w7dC40LekNfBwq17MhDvT_aN0YCI47gIpy1J4evqZ_DQBbXA$
> [7]
> https://urldefense.com/v3/__https://www.nsc.liu.se/software/installed/tetralith/wien2k/__;!!Dq0X2DkFhyF93HkjWTBQKhk!A8k2iWg8Er5xRYR-yIFE18w7dC40LekNfBwq17MhDvT_aN0YCI47gIpy1J4evqYdCEsGjw$
>
>
> On 11/13/2020 3:37 AM, Laurence Marks wrote:
>
>
> N.B., example mid-term questions:
> 1. What SBATCH command will give you 3 nodes?
> 2. What command creates your .machines file?
> 3. What are your fastest and slowest nodes?
> 4. Which nodes have the best communications.
>
>
> N.B., please don't post your answers -- just understand!
>
>
> _____
> Professor Laurence Marks
> "Research is to see what everybody else has seen, and to think what nobody
> else has thought", Albert Szent-Gyorgi
> http://www.numis.northwestern.edu
>
>
> On Fri, Nov 13, 2020, 04:21 Laurence Marks <laurence.marks at gmail.com>
> wrote:
>
> Much of what you are requesting is problem/cluster specific, so there is
> no magic answer -- it will vary. Suggestions:
> 1) Read the UG sections on .machines and parallel operation.
> 2) Read the man page for your cluster job command (srun)
> 3) Reread the UG sections.
> 4) Read the example scripts, and understand (lookup) all the commands so
> you know what they are doing.
>
>
> It is really not that complicated. If you cannot master this by yourself,
> I will wonder whether you are in the right profession.
>
>
> _____
> Professor Laurence Marks
> "Research is to see what everybody else has seen, and to think what nobody
> else has thought", Albert Szent-Gyorgi
> http://www.numis.northwestern.edu
>
>
> On Fri, Nov 13, 2020, 03:24 Dr. K. C. Bhamu <kcbhamu85 at gmail.com> wrote:
>
>
> Dear All
>
>
> I need your extensive help.
> I have tried to provide full details that can help you understand my
> requirement. In case I have missed something, please let me know.
>
>
> I am looking for a job file for our cluster. The available jobs files on
> FAQs are not working. They give me
> .machine0          .machines          .machines_current   files only
> wherein .machines has # and the other two are empty.
>
>
>
> The script that is working fine for Quantum Espresso for 44core partition
> is below
> #!/bin/sh
> #SBATCH -J test #job name
> #SBATCH -p 44core #partition name
> #SBATCH -N 1 #node
> #SBATCH -n 18 #core
> #SBATCH -o %x.o%j
> #SBATCH -e %x.e%j
> export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so #Do not change here!!
> srun ~/soft/qe66/bin/pw.x  <  case.in > case.out
>
>
>
> I have compiled Wien2k_19.2 on the Centos queuing system which has the
> head node of Centos kernel Linux 3.10.0-1127.19.1.el7.x86_64.
>
>
> I used compilers_and_libraries_2020.2.254 , fftw-3.3.8 , libxc-4.34 for
> the installation.
>
>
> The details of the nodes that I can use are as follows (I can login into
> these nodes with my user password):
> NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
> WEIGHT AVAIL_FE REASON
> elpidos        1    master        idle 4       4:1:1  15787        0
>  1   (null) none
> node01         1    72core   allocated 72     72:1:1 515683        0
>  1   (null) none
> node02         1    72core   allocated 72     72:1:1 257651        0
>  1   (null) none
> node03         1    72core   allocated 72     72:1:1 257651        0
>  1   (null) none
> node09         1    44core       mixed 44     44:1:1 128650        0
>  1   (null) none
> node10         1    44core       mixed 44     44:1:1 128649        0
>  1   (null) none
> node11         1   52core*   allocated 52     52:1:1 191932        0
>  1   (null) none
> node12         1   52core*   allocated 52     52:1:1 191932        0
>  1   (null) none
>
>
>
> The other nodes have a mixture of the kernel as below.
>
>
>    OS=Linux 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020
>    OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC
> 2020
>    OS=Linux 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016
>    OS=Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Tue May 14 21:24:32 UTC 2019
>
>
>
> Your extensive help will improve my research productivity.
>
>
> Thank you very much.
> Regards
> Bhamu           _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
>
> https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!A8k2iWg8Er5xRYR-yIFE18w7dC40LekNfBwq17MhDvT_aN0YCI47gIpy1J4evqZEAsi6ZQ$
> SEARCH the MAILING-LIST at:
> https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!A8k2iWg8Er5xRYR-yIFE18w7dC40LekNfBwq17MhDvT_aN0YCI47gIpy1J4evqYg6xEOJA$
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
>
> https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!A8k2iWg8Er5xRYR-yIFE18w7dC40LekNfBwq17MhDvT_aN0YCI47gIpy1J4evqZEAsi6ZQ$
> SEARCH the MAILING-LIST at:
> https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!A8k2iWg8Er5xRYR-yIFE18w7dC40LekNfBwq17MhDvT_aN0YCI47gIpy1J4evqYg6xEOJA$
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20201115/7280dcb2/attachment.htm>


More information about the Wien mailing list