[Wien] Need extensive help for a job file for slurm job scheduler cluster
Laurence Marks
laurence.marks at gmail.com
Fri Nov 13 11:37:10 CET 2020
N.B., example mid-term questions:
1. What SBATCH command will give you 3 nodes?
2. What command creates your .machines file?
3. What are your fastest and slowest nodes?
4. Which nodes have the best communications.
N.B., please don't post your answers -- just understand!
_____
Professor Laurence Marks
"Research is to see what everybody else has seen, and to think what nobody
else has thought", Albert Szent-Gyorgi
www.numis.northwestern.edu
On Fri, Nov 13, 2020, 04:21 Laurence Marks <laurence.marks at gmail.com> wrote:
> Much of what you are requesting is problem/cluster specific, so there is
> no magic answer -- it will vary. Suggestions:
> 1) Read the UG sections on .machines and parallel operation.
> 2) Read the man page for your cluster job command (srun)
> 3) Reread the UG sections.
> 4) Read the example scripts, and understand (lookup) all the commands so
> you know what they are doing.
>
> It is really not that complicated. If you cannot master this by yourself,
> I will wonder whether you are in the right profession.
>
> _____
> Professor Laurence Marks
> "Research is to see what everybody else has seen, and to think what nobody
> else has thought", Albert Szent-Gyorgi
> www.numis.northwestern.edu
>
> On Fri, Nov 13, 2020, 03:24 Dr. K. C. Bhamu <kcbhamu85 at gmail.com> wrote:
>
>> Dear All
>>
>> I need your extensive help.
>> I have tried to provide full details that can help you understand my
>> requirement. In case I have missed something, please let me know.
>>
>> I am looking for a job file for our cluster. The available jobs files on
>> FAQs are not working. They give me
>> .machine0 .machines .machines_current files only
>> wherein .machines has # and the other two are empty.
>>
>> The script that is working fine for Quantum Espresso for 44core partition
>> is below
>> #!/bin/sh
>> #SBATCH -J test #job name
>> #SBATCH -p 44core #partition name
>> #SBATCH -N 1 #node
>> #SBATCH -n 18 #core
>> #SBATCH -o %x.o%j
>> #SBATCH -e %x.e%j
>> export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so #Do not change here!!
>> srun ~/soft/qe66/bin/pw.x < case.in
>> <https://urldefense.com/v3/__http://case.in__;!!Dq0X2DkFhyF93HkjWTBQKhk!GAoAiAGPo-P9rf1ZIm9YcQa-sF1GVFoIXYQ5SUQSFmUQH3oCvMobKrJ6gbDtT98andJs2Q$>
>> > case.out
>>
>> I have compiled Wien2k_19.2 on the Centos queuing system which has the
>> head node of Centos kernel Linux 3.10.0-1127.19.1.el7.x86_64.
>>
>> I used compilers_and_libraries_2020.2.254 , fftw-3.3.8 , libxc-4.34 for
>> the installation.
>>
>> The details of the nodes that I can use are as follows (I can login into
>> these nodes with my user password):
>> NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK
>> WEIGHT AVAIL_FE REASON
>> elpidos 1 master idle 4 4:1:1 15787 0
>> 1 (null) none
>> node01 1 72core allocated 72 72:1:1 515683 0
>> 1 (null) none
>> node02 1 72core allocated 72 72:1:1 257651 0
>> 1 (null) none
>> node03 1 72core allocated 72 72:1:1 257651 0
>> 1 (null) none
>> node09 1 44core mixed 44 44:1:1 128650 0
>> 1 (null) none
>> node10 1 44core mixed 44 44:1:1 128649 0
>> 1 (null) none
>> node11 1 52core* allocated 52 52:1:1 191932 0
>> 1 (null) none
>> node12 1 52core* allocated 52 52:1:1 191932 0
>> 1 (null) none
>>
>> The other nodes have a mixture of the kernel as below.
>>
>> OS=Linux 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC
>> 2020
>> OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC
>> 2020
>> OS=Linux 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016
>> OS=Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Tue May 14 21:24:32 UTC
>> 2019
>>
>> Your extensive help will improve my research productivity.
>>
>> Thank you very much.
>> Regards
>> Bhamu
>>
>> *Full details of the nodes are here:*
>>
>> NodeName=elpidos Arch=x86_64 CoresPerSocket=1
>> CPUAlloc=0 CPUTot=4 CPULoad=0.06
>> AvailableFeatures=(null)
>> ActiveFeatures=(null)
>> Gres=(null)
>> NodeAddr=10.0.0.250 NodeHostName=elpidos Version=20.02.3
>> OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC
>> 2020
>> RealMemory=15787 AllocMem=0 FreeMem=5597 Sockets=4 Boards=1
>> State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>> Partitions=master
>> BootTime=2020-10-13T14:25:13 SlurmdStartTime=2020-10-13T14:25:26
>> CfgTRES=cpu=4,mem=15787M,billing=4
>> AllocTRES=
>> CapWatts=n/a
>> CurrentWatts=0 AveWatts=0
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>
>> NodeName=node01 Arch=x86_64 CoresPerSocket=1
>> CPUAlloc=72 CPUTot=72 CPULoad=72.00
>> AvailableFeatures=(null)
>> ActiveFeatures=(null)
>> Gres=(null)
>> NodeAddr=10.0.0.1 NodeHostName=node01 Version=20.02.3
>> OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC
>> 2020
>> RealMemory=515683 AllocMem=0 FreeMem=363362 Sockets=72 Boards=1
>> State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
>> MCS_label=N/A
>> Partitions=72core
>> BootTime=2020-10-13T20:44:04 SlurmdStartTime=2020-10-14T05:44:23
>> CfgTRES=cpu=72,mem=515683M,billing=72
>> AllocTRES=cpu=72
>> CapWatts=n/a
>> CurrentWatts=0 AveWatts=0
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>
>> NodeName=node02 Arch=x86_64 CoresPerSocket=1
>> CPUAlloc=72 CPUTot=72 CPULoad=71.92
>> AvailableFeatures=(null)
>> ActiveFeatures=(null)
>> Gres=(null)
>> NodeAddr=10.0.0.2 NodeHostName=node02 Version=20.02.3
>> OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC
>> 2020
>> RealMemory=257651 AllocMem=0 FreeMem=142057 Sockets=72 Boards=1
>> State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
>> MCS_label=N/A
>> Partitions=72core
>> BootTime=2020-10-13T20:44:04 SlurmdStartTime=2020-10-14T05:44:17
>> CfgTRES=cpu=72,mem=257651M,billing=72
>> AllocTRES=cpu=72
>> CapWatts=n/a
>> CurrentWatts=0 AveWatts=0
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>
>> NodeName=node03 Arch=x86_64 CoresPerSocket=1
>> CPUAlloc=72 CPUTot=72 CPULoad=71.96
>> AvailableFeatures=(null)
>> ActiveFeatures=(null)
>> Gres=(null)
>> NodeAddr=10.0.0.3 NodeHostName=node03 Version=20.02.3
>> OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC
>> 2020
>> RealMemory=257651 AllocMem=0 FreeMem=168118 Sockets=72 Boards=1
>> State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
>> MCS_label=N/A
>> Partitions=72core
>> BootTime=2020-10-13T20:44:33 SlurmdStartTime=2020-10-14T05:43:35
>> CfgTRES=cpu=72,mem=257651M,billing=72
>> AllocTRES=cpu=72
>> CapWatts=n/a
>> CurrentWatts=0 AveWatts=0
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>
>> NodeName=node04 Arch=x86_64 CoresPerSocket=1
>> CPUAlloc=0 CPUTot=20 CPULoad=0.01
>> AvailableFeatures=(null)
>> ActiveFeatures=(null)
>> Gres=(null)
>> NodeAddr=10.0.0.4 NodeHostName=node04 Version=20.02.3
>> OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC
>> 2020
>> RealMemory=128664 AllocMem=0 FreeMem=126677 Sockets=20 Boards=1
>> State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>> Partitions=20core
>> BootTime=2020-10-13T20:43:24 SlurmdStartTime=2020-10-14T05:42:43
>> CfgTRES=cpu=20,mem=128664M,billing=20
>> AllocTRES=
>> CapWatts=n/a
>> CurrentWatts=0 AveWatts=0
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>
>> NodeName=node05 Arch=x86_64 CoresPerSocket=1
>> CPUAlloc=0 CPUTot=4 CPULoad=0.01
>> AvailableFeatures=(null)
>> ActiveFeatures=(null)
>> Gres=(null)
>> NodeAddr=10.0.0.5 NodeHostName=node05 Version=20.02.3
>> OS=Linux 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC
>> 2020
>> RealMemory=64190 AllocMem=0 FreeMem=63350 Sockets=4 Boards=1
>> State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>> Partitions=4core
>> BootTime=2020-10-29T11:34:18 SlurmdStartTime=2020-10-29T11:34:30
>> CfgTRES=cpu=4,mem=64190M,billing=4
>> AllocTRES=
>> CapWatts=n/a
>> CurrentWatts=0 AveWatts=0
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>
>> NodeName=node06 Arch=x86_64 CoresPerSocket=1
>> CPUAlloc=0 CPUTot=4 CPULoad=0.01
>> AvailableFeatures=(null)
>> ActiveFeatures=(null)
>> Gres=(null)
>> NodeAddr=10.0.0.6 NodeHostName=node06 Version=20.02.3
>> OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC
>> 2020
>> RealMemory=64190 AllocMem=0 FreeMem=63084 Sockets=4 Boards=1
>> State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>> Partitions=4core
>> BootTime=2020-10-19T11:07:32 SlurmdStartTime=2020-10-19T11:07:51
>> CfgTRES=cpu=4,mem=64190M,billing=4
>> AllocTRES=
>> CapWatts=n/a
>> CurrentWatts=0 AveWatts=0
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>
>> NodeName=node07 Arch=x86_64 CoresPerSocket=1
>> CPUAlloc=0 CPUTot=64 CPULoad=0.01
>> AvailableFeatures=(null)
>> ActiveFeatures=(null)
>> Gres=(null)
>> NodeAddr=10.0.0.7 NodeHostName=node07 Version=20.02.3
>> OS=Linux 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016
>> RealMemory=80241 AllocMem=0 FreeMem=75316 Sockets=64 Boards=1
>> State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>> Partitions=64core
>> BootTime=2020-10-13T20:52:40 SlurmdStartTime=2020-10-13T21:10:59
>> CfgTRES=cpu=64,mem=80241M,billing=64
>> AllocTRES=
>> CapWatts=n/a
>> CurrentWatts=0 AveWatts=0
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>
>> NodeName=node08 Arch=x86_64 CoresPerSocket=1
>> CPUAlloc=0 CPUTot=64 CPULoad=0.01
>> AvailableFeatures=(null)
>> ActiveFeatures=(null)
>> Gres=(null)
>> NodeAddr=10.0.0.8 NodeHostName=node08 Version=20.02.3
>> OS=Linux 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016
>> RealMemory=47987 AllocMem=0 FreeMem=42188 Sockets=64 Boards=1
>> State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>> Partitions=64core
>> BootTime=2020-10-13T20:51:08 SlurmdStartTime=2020-10-13T20:57:12
>> CfgTRES=cpu=64,mem=47987M,billing=64
>> AllocTRES=
>> CapWatts=n/a
>> CurrentWatts=0 AveWatts=0
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>
>> NodeName=node09 Arch=x86_64 CoresPerSocket=1
>> CPUAlloc=36 CPUTot=44 CPULoad=35.99
>> AvailableFeatures=(null)
>> ActiveFeatures=(null)
>> Gres=(null)
>> NodeAddr=10.0.0.9 NodeHostName=node09 Version=20.02.3
>> OS=Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Tue May 14 21:24:32 UTC
>> 2019
>> RealMemory=128650 AllocMem=0 FreeMem=78059 Sockets=44 Boards=1
>> State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>> Partitions=44core
>> BootTime=2020-10-13T20:47:11 SlurmdStartTime=2020-10-13T20:47:29
>> CfgTRES=cpu=44,mem=128650M,billing=44
>> AllocTRES=cpu=36
>> CapWatts=n/a
>> CurrentWatts=0 AveWatts=0
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>
>> NodeName=node10 Arch=x86_64 CoresPerSocket=1
>> CPUAlloc=18 CPUTot=44 CPULoad=18.01
>> AvailableFeatures=(null)
>> ActiveFeatures=(null)
>> Gres=(null)
>> NodeAddr=10.0.0.10 NodeHostName=node10 Version=20.02.3
>> OS=Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Tue May 14 21:24:32 UTC
>> 2019
>> RealMemory=128649 AllocMem=0 FreeMem=82279 Sockets=44 Boards=1
>> State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>> Partitions=44core
>> BootTime=2020-10-13T20:47:36 SlurmdStartTime=2020-10-13T20:48:00
>> CfgTRES=cpu=44,mem=128649M,billing=44
>> AllocTRES=cpu=18
>> CapWatts=n/a
>> CurrentWatts=0 AveWatts=0
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>
>> NodeName=node11 Arch=x86_64 CoresPerSocket=1
>> CPUAlloc=52 CPUTot=52 CPULoad=52.01
>> AvailableFeatures=(null)
>> ActiveFeatures=(null)
>> Gres=(null)
>> NodeAddr=10.0.0.11 NodeHostName=node11 Version=20.02.3
>> OS=Linux 3.10.0-1127.13.1.el7.x86_64 #1 SMP Tue Jun 23 15:46:38 UTC
>> 2020
>> RealMemory=191932 AllocMem=0 FreeMem=147904 Sockets=52 Boards=1
>> State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
>> MCS_label=N/A
>> Partitions=52core
>> BootTime=2020-10-13T20:47:02 SlurmdStartTime=2020-10-13T20:47:13
>> CfgTRES=cpu=52,mem=191932M,billing=52
>> AllocTRES=cpu=52
>> CapWatts=n/a
>> CurrentWatts=0 AveWatts=0
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>
>> NodeName=node12 Arch=x86_64 CoresPerSocket=1
>> CPUAlloc=52 CPUTot=52 CPULoad=52.01
>> AvailableFeatures=(null)
>> ActiveFeatures=(null)
>> Gres=(null)
>> NodeAddr=10.0.0.12 NodeHostName=node12 Version=20.02.3
>> OS=Linux 3.10.0-1127.13.1.el7.x86_64 #1 SMP Tue Jun 23 15:46:38 UTC
>> 2020
>> RealMemory=191932 AllocMem=0 FreeMem=162998 Sockets=52 Boards=1
>> State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
>> MCS_label=N/A
>> Partitions=52core
>> BootTime=2020-10-13T20:47:31 SlurmdStartTime=2020-10-13T20:47:42
>> CfgTRES=cpu=52,mem=191932M,billing=52
>> AllocTRES=cpu=52
>> CapWatts=n/a
>> CurrentWatts=0 AveWatts=0
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>>
>> NodeName=node13 Arch=x86_64 CoresPerSocket=1
>> CPUAlloc=0 CPUTot=4 CPULoad=0.01
>> AvailableFeatures=(null)
>> ActiveFeatures=(null)
>> Gres=(null)
>> NodeAddr=10.0.0.13 NodeHostName=node13 Version=20.02.3
>> OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC
>> 2020
>> RealMemory=31836 AllocMem=0 FreeMem=31093 Sockets=4 Boards=1
>> State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>> Partitions=4core
>> BootTime=2020-10-13T20:48:12 SlurmdStartTime=2020-10-13T20:48:20
>> CfgTRES=cpu=4,mem=31836M,billing=4
>> AllocTRES=
>> CapWatts=n/a
>> CurrentWatts=0 AveWatts=0
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>>
>> https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!GAoAiAGPo-P9rf1ZIm9YcQa-sF1GVFoIXYQ5SUQSFmUQH3oCvMobKrJ6gbDtT983sRoe0A$
>> SEARCH the MAILING-LIST at:
>> https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!GAoAiAGPo-P9rf1ZIm9YcQa-sF1GVFoIXYQ5SUQSFmUQH3oCvMobKrJ6gbDtT99O8q-FWg$
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20201113/ac7daa88/attachment.htm>
More information about the Wien
mailing list