[Wien] Need extensive help for a job file for slurm job scheduler cluster

Dr. K. C. Bhamu kcbhamu85 at gmail.com
Sun Nov 15 07:55:36 CET 2020


Dear Gavin and Prof. Marks
Thank you for your inputs.
qsub MyJobFIle.job creates the .machines file.

With the below given  job file, I could create the proper .machine files
(equal to number of cores in the node and .machines file) but  lapw1 always
crashes

*case.dayfile is*

Calculating pbe in /home/kcbhamu/work/test/pbe
on node11 with PID 9241
using WIEN2k_19.1 (Release 25/6/2019) in /home/kcbhamu/soft/w2k192


    start (Sun Nov 15 15:42:05 KST 2020) with lapw0 (40/99 to go)

    cycle 1 (Sun Nov 15 15:42:05 KST 2020) (40/99 to go)

>   lapw0   -p (15:42:05) starting parallel lapw0 at Sun Nov 15 15:42:05
KST 2020
-------- .machine0 : processors
running lapw0 in single mode
7.281u 0.272s 0:07.64 98.8% 0+0k 1000+1216io 0pf+0w
>   lapw1  -p     (15:42:13) starting parallel lapw1 at Sun Nov 15 15:42:13
KST 2020
->  starting parallel LAPW1 jobs at Sun Nov 15 15:42:13 KST 2020
running LAPW1 in parallel mode (using .machines)
16 number_of_parallel_jobs
0.200u 0.369s 0:00.59 94.9% 0+0k 208+456io 0pf+0w
error: command   /home/kcbhamu/soft/w2k192/lapw1para lapw1.def   failed

>   stop error

*the job.eout file indicates below error:*

But I am getting below error

bc: Command not found.
 LAPW0 END
bc: Command not found.
number_per_job: Subscript out of range.
grep: *scf1*: No such file or directory
grep: lapw2*.error: No such file or directory


*.machines file is give below*

1:node11
1:node11
1:node11
1:node11
1:node11
1:node11
1:node11
1:node11
1:node11
1:node11
1:node11
1:node11
1:node11
1:node11
1:node11
1:node11
granularity:1
extrafine:1


*parallel_options file*
setenv TASKSET "no"
if ( ! $?USE_REMOTE ) setenv USE_REMOTE 0
if ( ! $?MPI_REMOTE ) setenv MPI_REMOTE 0
setenv WIEN_GRANULARITY 1
setenv DELAY 0.1
setenv SLEEPY 1
setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"
setenv CORES_PER_NODE 16

*job file*

#!/bin/sh
#SBATCH -J test
#SBATCH -p 52core    # THis is the name of the partition.
#SBATCH -N 1
#SBATCH -n 16
#SBATCH -o %x.o%j
#SBATCH -e %x.e%j
#export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so

export OMP_NUM_THREADS=16     # I have check with 1,2 4, 8 also.

# Use , as list separator
IFS=','
# Convert string to array
hcpus=($SLURM_JOB_CPUS_PER_NODE)
unset IFS

declare -a conv

# Expand compressed slurm array
for cpu in ${hcpus[@]}; do
     if [[ $cpu =~ (.*)\((.*)x\) ]]; then
# found compressed value
value=${BASH_REMATCH[1]}
factor=${BASH_REMATCH[2]}
for j in $(seq 1 $factor); do
   conv=( ${conv[*]} $value )
done
     else
conv=( ${conv[*]} $cpu )
     fi
done

# Build .machines file
rm -f .machines

nhost=0

echo ${conv[@]};

IFS=','
for node in $SLURM_NODELIST
do
    declare -i cpuspernode=${conv[$nhost]};
    for ((i=0; i<${cpuspernode}; i++))
    do
echo 1:$node >> .machines
    done
    let nhost+=1
done

echo 'granularity:1' >>.machines
echo 'extrafine:1' >>.machines


run_lapw -p


Thank you very much

Regards
Bhamu



On Fri, Nov 13, 2020 at 7:04 PM Gavin Abo <gsabo at crimson.ua.edu> wrote:

> If you have a look at [1], it can be seen that different cluster systems
> have different commands for job submission.
>
> I did not see it clearly shown in your post how the job was submitted, for
> example did you maybe use something similar to that at [2]:
>
> $ sbatch MyJobScript.sh
>
> *What command creates your .machines file?*
>
> In your MyJobScript.sh below, I'm not seeing any lines that create a
> .machines file.
> MyJobScript.sh
>
> --------------------------------------------------------------------------------------------------------
> #!/bin/sh
> #SBATCH -J test #job name
> #SBATCH -p 44core #partition name
> #SBATCH -N 1 #node
> #SBATCH -n 18 #core
> #SBATCH -o %x.o%j
> #SBATCH -e %x.e%j
> export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so #Do not change here!!
> srun ~/soft/qe66/bin/pw.x  < case.in > case.out
> --------------------------------------------------------------------------------------------------------
>
>
> The available jobs files on FAQs are not working. They give me
> .machine0          .machines          .machines_current   files only
> wherein .machines has # and the other two are empty.
>
> In the Slurm documentation at [3], it looks like there is variable for
> helping creating a list of nodes on the fly that would need to be written
> to the .machines file:
>
> SLURM_JOB_NODELIST (and SLURM_NODELIST for backwards compatibility)
>
> I'm not seeing this in your MyJobScript.sh like that seen in other job
> scripts found on the Internet, for example [4-7].
> [1] https://slurm.schedmd.com/rosetta.pdf
> [2] https://hpc-uit.readthedocs.io/en/latest/jobs/examples.html
> [3] https://slurm.schedmd.com/sbatch.html
> [4] https://itp.uni-frankfurt.de/wiki-it/index.php/Wien2k
> [5]
> https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg15511.html
> [6]
> https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg07097.html
> [7] https://www.nsc.liu.se/software/installed/tetralith/wien2k/
>
> On 11/13/2020 3:37 AM, Laurence Marks wrote:
>
> N.B., example mid-term questions:
> 1. What SBATCH command will give you 3 nodes?
> 2. What command creates your .machines file?
> 3. What are your fastest and slowest nodes?
> 4. Which nodes have the best communications.
>
> N.B., please don't post your answers -- just understand!
>
> _____
> Professor Laurence Marks
> "Research is to see what everybody else has seen, and to think what nobody
> else has thought", Albert Szent-Gyorgi
> www.numis.northwestern.edu
>
> On Fri, Nov 13, 2020, 04:21 Laurence Marks <laurence.marks at gmail.com>
> wrote:
>
>> Much of what you are requesting is problem/cluster specific, so there is
>> no magic answer -- it will vary. Suggestions:
>> 1) Read the UG sections on .machines and parallel operation.
>> 2) Read the man page for your cluster job command (srun)
>> 3) Reread the UG sections.
>> 4) Read the example scripts, and understand (lookup) all the commands so
>> you know what they are doing.
>>
>> It is really not that complicated. If you cannot master this by yourself,
>> I will wonder whether you are in the right profession.
>>
>> _____
>> Professor Laurence Marks
>> "Research is to see what everybody else has seen, and to think what
>> nobody else has thought", Albert Szent-Gyorgi
>> www.numis.northwestern.edu
>>
>> On Fri, Nov 13, 2020, 03:24 Dr. K. C. Bhamu <kcbhamu85 at gmail.com> wrote:
>>
>>> Dear All
>>>
>>> I need your extensive help.
>>> I have tried to provide full details that can help you understand my
>>> requirement. In case I have missed something, please let me know.
>>>
>>> I am looking for a job file for our cluster. The available jobs files on
>>> FAQs are not working. They give me
>>> .machine0          .machines          .machines_current   files only
>>> wherein .machines has # and the other two are empty.
>>>
>>> The script that is working fine for Quantum Espresso for 44core
>>> partition is below
>>> #!/bin/sh
>>> #SBATCH -J test #job name
>>> #SBATCH -p 44core #partition name
>>> #SBATCH -N 1 #node
>>> #SBATCH -n 18 #core
>>> #SBATCH -o %x.o%j
>>> #SBATCH -e %x.e%j
>>> export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so #Do not change here!!
>>> srun ~/soft/qe66/bin/pw.x  < case.in
>>> <https://urldefense.com/v3/__http://case.in__;!!Dq0X2DkFhyF93HkjWTBQKhk!GAoAiAGPo-P9rf1ZIm9YcQa-sF1GVFoIXYQ5SUQSFmUQH3oCvMobKrJ6gbDtT98andJs2Q$>
>>> > case.out
>>>
>>> I have compiled Wien2k_19.2 on the Centos queuing system which has the
>>> head node of Centos kernel Linux 3.10.0-1127.19.1.el7.x86_64.
>>>
>>> I used compilers_and_libraries_2020.2.254 , fftw-3.3.8 , libxc-4.34 for
>>> the installation.
>>>
>>> The details of the nodes that I can use are as follows (I can login into
>>> these nodes with my user password):
>>> NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
>>> WEIGHT AVAIL_FE REASON
>>> elpidos        1    master        idle 4       4:1:1  15787        0
>>>  1   (null) none
>>> node01         1    72core   allocated 72     72:1:1 515683        0
>>>  1   (null) none
>>> node02         1    72core   allocated 72     72:1:1 257651        0
>>>  1   (null) none
>>> node03         1    72core   allocated 72     72:1:1 257651        0
>>>  1   (null) none
>>> node09         1    44core       mixed 44     44:1:1 128650        0
>>>  1   (null) none
>>> node10         1    44core       mixed 44     44:1:1 128649        0
>>>  1   (null) none
>>> node11         1   52core*   allocated 52     52:1:1 191932        0
>>>  1   (null) none
>>> node12         1   52core*   allocated 52     52:1:1 191932        0
>>>  1   (null) none
>>>
>>> The other nodes have a mixture of the kernel as below.
>>>
>>>    OS=Linux 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC
>>> 2020
>>>    OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC
>>> 2020
>>>    OS=Linux 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016
>>>    OS=Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Tue May 14 21:24:32 UTC
>>> 2019
>>>
>>> Your extensive help will improve my research productivity.
>>>
>>> Thank you very much.
>>> Regards
>>> Bhamu
>>>
>> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20201115/5c18168b/attachment.htm>


More information about the Wien mailing list