[Wien] Need extensive help for a job file for slurm job scheduler cluster

Dr. K. C. Bhamu kcbhamu85 at gmail.com
Sun Nov 15 08:07:02 CET 2020


Additional information (maybe this is the main cause of the lapw1 crash):
bc is only working on the head node. node11 or other clint nodes are not
having bc installed.
If the bc is only the issue then is it possible to modify the job file such
that it uses bc on the head node only.

Thank you
Bhamu

On Sun, Nov 15, 2020 at 12:25 PM Dr. K. C. Bhamu <kcbhamu85 at gmail.com>
wrote:

> Dear Gavin and Prof. Marks
> Thank you for your inputs.
> qsub MyJobFIle.job creates the .machines file.
>
> With the below given  job file, I could create the proper .machine files
> (equal to number of cores in the node and .machines file) but  lapw1
> always crashes
>
> *case.dayfile is*
>
> Calculating pbe in /home/kcbhamu/work/test/pbe
> on node11 with PID 9241
> using WIEN2k_19.1 (Release 25/6/2019) in /home/kcbhamu/soft/w2k192
>
>
>     start (Sun Nov 15 15:42:05 KST 2020) with lapw0 (40/99 to go)
>
>     cycle 1 (Sun Nov 15 15:42:05 KST 2020) (40/99 to go)
>
> >   lapw0   -p (15:42:05) starting parallel lapw0 at Sun Nov 15 15:42:05
> KST 2020
> -------- .machine0 : processors
> running lapw0 in single mode
> 7.281u 0.272s 0:07.64 98.8% 0+0k 1000+1216io 0pf+0w
> >   lapw1  -p     (15:42:13) starting parallel lapw1 at Sun Nov 15
> 15:42:13 KST 2020
> ->  starting parallel LAPW1 jobs at Sun Nov 15 15:42:13 KST 2020
> running LAPW1 in parallel mode (using .machines)
> 16 number_of_parallel_jobs
> 0.200u 0.369s 0:00.59 94.9% 0+0k 208+456io 0pf+0w
> error: command   /home/kcbhamu/soft/w2k192/lapw1para lapw1.def   failed
>
> >   stop error
>
> *the job.eout file indicates below error:*
>
> But I am getting below error
>
> bc: Command not found.
>  LAPW0 END
> bc: Command not found.
> number_per_job: Subscript out of range.
> grep: *scf1*: No such file or directory
> grep: lapw2*.error: No such file or directory
>
>
> *.machines file is give below*
>
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> 1:node11
> granularity:1
> extrafine:1
>
>
> *parallel_options file*
> setenv TASKSET "no"
> if ( ! $?USE_REMOTE ) setenv USE_REMOTE 0
> if ( ! $?MPI_REMOTE ) setenv MPI_REMOTE 0
> setenv WIEN_GRANULARITY 1
> setenv DELAY 0.1
> setenv SLEEPY 1
> setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"
> setenv CORES_PER_NODE 16
>
> *job file*
>
> #!/bin/sh
> #SBATCH -J test
> #SBATCH -p 52core    # THis is the name of the partition.
> #SBATCH -N 1
> #SBATCH -n 16
> #SBATCH -o %x.o%j
> #SBATCH -e %x.e%j
> #export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
>
> export OMP_NUM_THREADS=16     # I have check with 1,2 4, 8 also.
>
> # Use , as list separator
> IFS=','
> # Convert string to array
> hcpus=($SLURM_JOB_CPUS_PER_NODE)
> unset IFS
>
> declare -a conv
>
> # Expand compressed slurm array
> for cpu in ${hcpus[@]}; do
>      if [[ $cpu =~ (.*)\((.*)x\) ]]; then
> # found compressed value
> value=${BASH_REMATCH[1]}
> factor=${BASH_REMATCH[2]}
> for j in $(seq 1 $factor); do
>    conv=( ${conv[*]} $value )
> done
>      else
> conv=( ${conv[*]} $cpu )
>      fi
> done
>
> # Build .machines file
> rm -f .machines
>
> nhost=0
>
> echo ${conv[@]};
>
> IFS=','
> for node in $SLURM_NODELIST
> do
>     declare -i cpuspernode=${conv[$nhost]};
>     for ((i=0; i<${cpuspernode}; i++))
>     do
> echo 1:$node >> .machines
>     done
>     let nhost+=1
> done
>
> echo 'granularity:1' >>.machines
> echo 'extrafine:1' >>.machines
>
>
> run_lapw -p
>
>
> Thank you very much
>
> Regards
> Bhamu
>
>
>
> On Fri, Nov 13, 2020 at 7:04 PM Gavin Abo <gsabo at crimson.ua.edu> wrote:
>
>> If you have a look at [1], it can be seen that different cluster systems
>> have different commands for job submission.
>>
>> I did not see it clearly shown in your post how the job was submitted,
>> for example did you maybe use something similar to that at [2]:
>>
>> $ sbatch MyJobScript.sh
>>
>> *What command creates your .machines file?*
>>
>> In your MyJobScript.sh below, I'm not seeing any lines that create a
>> .machines file.
>> MyJobScript.sh
>>
>> --------------------------------------------------------------------------------------------------------
>> #!/bin/sh
>> #SBATCH -J test #job name
>> #SBATCH -p 44core #partition name
>> #SBATCH -N 1 #node
>> #SBATCH -n 18 #core
>> #SBATCH -o %x.o%j
>> #SBATCH -e %x.e%j
>> export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so #Do not change here!!
>> srun ~/soft/qe66/bin/pw.x  < case.in > case.out
>> --------------------------------------------------------------------------------------------------------
>>
>>
>> The available jobs files on FAQs are not working. They give me
>> .machine0          .machines          .machines_current   files only
>> wherein .machines has # and the other two are empty.
>>
>> In the Slurm documentation at [3], it looks like there is variable for
>> helping creating a list of nodes on the fly that would need to be written
>> to the .machines file:
>>
>> SLURM_JOB_NODELIST (and SLURM_NODELIST for backwards compatibility)
>>
>> I'm not seeing this in your MyJobScript.sh like that seen in other job
>> scripts found on the Internet, for example [4-7].
>> [1] https://slurm.schedmd.com/rosetta.pdf
>> [2] https://hpc-uit.readthedocs.io/en/latest/jobs/examples.html
>> [3] https://slurm.schedmd.com/sbatch.html
>> [4] https://itp.uni-frankfurt.de/wiki-it/index.php/Wien2k
>> [5]
>> https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg15511.html
>> [6]
>> https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg07097.html
>> [7] https://www.nsc.liu.se/software/installed/tetralith/wien2k/
>>
>> On 11/13/2020 3:37 AM, Laurence Marks wrote:
>>
>> N.B., example mid-term questions:
>> 1. What SBATCH command will give you 3 nodes?
>> 2. What command creates your .machines file?
>> 3. What are your fastest and slowest nodes?
>> 4. Which nodes have the best communications.
>>
>> N.B., please don't post your answers -- just understand!
>>
>> _____
>> Professor Laurence Marks
>> "Research is to see what everybody else has seen, and to think what
>> nobody else has thought", Albert Szent-Gyorgi
>> www.numis.northwestern.edu
>>
>> On Fri, Nov 13, 2020, 04:21 Laurence Marks <laurence.marks at gmail.com>
>> wrote:
>>
>>> Much of what you are requesting is problem/cluster specific, so there is
>>> no magic answer -- it will vary. Suggestions:
>>> 1) Read the UG sections on .machines and parallel operation.
>>> 2) Read the man page for your cluster job command (srun)
>>> 3) Reread the UG sections.
>>> 4) Read the example scripts, and understand (lookup) all the commands so
>>> you know what they are doing.
>>>
>>> It is really not that complicated. If you cannot master this by
>>> yourself, I will wonder whether you are in the right profession.
>>>
>>> _____
>>> Professor Laurence Marks
>>> "Research is to see what everybody else has seen, and to think what
>>> nobody else has thought", Albert Szent-Gyorgi
>>> www.numis.northwestern.edu
>>>
>>> On Fri, Nov 13, 2020, 03:24 Dr. K. C. Bhamu <kcbhamu85 at gmail.com> wrote:
>>>
>>>> Dear All
>>>>
>>>> I need your extensive help.
>>>> I have tried to provide full details that can help you understand my
>>>> requirement. In case I have missed something, please let me know.
>>>>
>>>> I am looking for a job file for our cluster. The available jobs files
>>>> on FAQs are not working. They give me
>>>> .machine0          .machines          .machines_current   files only
>>>> wherein .machines has # and the other two are empty.
>>>>
>>>> The script that is working fine for Quantum Espresso for 44core
>>>> partition is below
>>>> #!/bin/sh
>>>> #SBATCH -J test #job name
>>>> #SBATCH -p 44core #partition name
>>>> #SBATCH -N 1 #node
>>>> #SBATCH -n 18 #core
>>>> #SBATCH -o %x.o%j
>>>> #SBATCH -e %x.e%j
>>>> export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so #Do not change here!!
>>>> srun ~/soft/qe66/bin/pw.x  < case.in
>>>> <https://urldefense.com/v3/__http://case.in__;!!Dq0X2DkFhyF93HkjWTBQKhk!GAoAiAGPo-P9rf1ZIm9YcQa-sF1GVFoIXYQ5SUQSFmUQH3oCvMobKrJ6gbDtT98andJs2Q$>
>>>> > case.out
>>>>
>>>> I have compiled Wien2k_19.2 on the Centos queuing system which has the
>>>> head node of Centos kernel Linux 3.10.0-1127.19.1.el7.x86_64.
>>>>
>>>> I used compilers_and_libraries_2020.2.254 , fftw-3.3.8 , libxc-4.34 for
>>>> the installation.
>>>>
>>>> The details of the nodes that I can use are as follows (I can login
>>>> into these nodes with my user password):
>>>> NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
>>>> WEIGHT AVAIL_FE REASON
>>>> elpidos        1    master        idle 4       4:1:1  15787        0
>>>>    1   (null) none
>>>> node01         1    72core   allocated 72     72:1:1 515683        0
>>>>    1   (null) none
>>>> node02         1    72core   allocated 72     72:1:1 257651        0
>>>>    1   (null) none
>>>> node03         1    72core   allocated 72     72:1:1 257651        0
>>>>    1   (null) none
>>>> node09         1    44core       mixed 44     44:1:1 128650        0
>>>>    1   (null) none
>>>> node10         1    44core       mixed 44     44:1:1 128649        0
>>>>    1   (null) none
>>>> node11         1   52core*   allocated 52     52:1:1 191932        0
>>>>    1   (null) none
>>>> node12         1   52core*   allocated 52     52:1:1 191932        0
>>>>    1   (null) none
>>>>
>>>> The other nodes have a mixture of the kernel as below.
>>>>
>>>>    OS=Linux 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC
>>>> 2020
>>>>    OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC
>>>> 2020
>>>>    OS=Linux 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016
>>>>    OS=Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Tue May 14 21:24:32 UTC
>>>> 2019
>>>>
>>>> Your extensive help will improve my research productivity.
>>>>
>>>> Thank you very much.
>>>> Regards
>>>> Bhamu
>>>>
>>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at:
>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20201115/16ee4fdf/attachment.htm>


More information about the Wien mailing list