[Wien] Parallel execution of SCF cycle
Calum Cunningham
Calum.Cunningham at uknnl.com
Wed Feb 1 10:08:11 CET 2023
Thank you Peter - do you have any advice on how we can fix the parallel_options issue? I have attempted manually editing the file with your example below but this has resulted in more errors (probably as expected?).
Is it likely that this is an installation/configuration issue?
_________________________
Thanks for the advice on omp_global, I rushed the editing yesterday but I will change it now.
_________________________
Here is the .processes file:
init:
1 : : 70 : 16 : 1 : 0
-----Original Message-----
From: Wien <wien-bounces at zeus.theochem.tuwien.ac.at> On Behalf Of Peter Blaha
Sent: 31 January 2023 18:10
To: wien at zeus.theochem.tuwien.ac.at
Subject: Re: [Wien] Parallel execution of SCF cycle
You should have a definition of WIEN_MPIRUN like
setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"
in parallel_options ?? It seems that parallel_options is not complete ??
-----------------------
setenv TASKSET "no"
if ( ! $?USE_REMOTE ) setenv USE_REMOTE 1 if ( ! $?MPI_REMOTE ) setenv MPI_REMOTE 0 setenv WIEN_GRANULARITY 1 setenv DELAY 0.1 setenv SLEEPY 1 setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"
setenv CORES_PER_NODE 1
---------------------------------
Also: It does NOT make sense to use mpi-parallelization with 16 cores
AND set omp_global:16
This directs the code to span 16*16 threads on one node.
omp_xxx is not mandatory, it will take by default 1 or the value set
in .bashrc (at least when you ran userconfig_lapw).
-----------------------------
About the tmpmach error:
How does your .processes file look like ?
It could be this is related to your parallel_options, but could also be
your machinename.
I never had a "." in the machine name like your sqg1cintr20.bullx
(and I can hardly test it ...)
Am 31.01.2023 um 18:55 schrieb Calum Cunningham:
> Thanks for your quick response Laurence.
>
> "cat $WIENROOT/parallel_options" gave the following output:
>
> setenv TASKSET "no"
>
> if ( ! $?USE_REMOTE ) setenv USE_REMOTE 1
>
> if ( ! $?MPI_REMOTE ) setenv MPI_REMOTE 0
>
> setenv WIEN_GRANULARITY 1
>
> setenv DELAY 0.1
>
> setenv SLEEPY 1
>
> I believe Intel MPI is the default on the cluster we are using. I
> double checked by loading in the Intel_mpi module and re-running, but
> I still encounter the same errors as before.
>
> Also, I have now modified the .machines file as you suggested (see
> below), but the same errors still arise. (Note: I am aware that
> choosing
> 16 cores for each omp_XXX may not be optimal for speed, but for now I
> am just testing if it works)
>
> # .machines file for Wien2k
>
> #
>
> omp_global:16
>
> omp_lapw1:16
>
> omp_lapw2:16
>
> 1:sqg1cintr20.bullx:16
>
> granularity:1
>
> extrafine:1
>
> lapw0: sqg1cintr20.bullx:16
>
> dstart: sqg1cintr20.bullx:16
>
> nlvdw: sqg1cintr20.bullx:16
>
> *From:*Wien <wien-bounces at zeus.theochem.tuwien.ac.at> *On Behalf Of
> *Laurence Marks
> *Sent:* 31 January 2023 17:11
> *To:* A Mailing list for WIEN2k users
> <wien at zeus.theochem.tuwien.ac.at>
> *Subject:* Re: [Wien] Parallel execution of SCF cycle
>
> Please do "cat $WIENROOT/parallel_options", as I suspect you have an
> issue there.
>
> Do you have a "normal" mpirun or does your cluster require something
> different?
>
> Which mpirun are you using?
>
> Also, I doubt you need "lapw2_vector_split:2", and you do not appear
> to have set the "omp_XXX" variables which are needed for recent versions.
>
> On Tue, Jan 31, 2023 at 10:59 AM Calum Cunningham
> <Calum.Cunningham at uknnl.com <mailto:Calum.Cunningham at uknnl.com>> wrote:
>
> Dear WIEN2k users,
>
> My colleagues and I are having some trouble running SCF calculations
> in parallel mode. I have had no issues when working in serial mode.
> We are using version 21.1 on a computer cluster that operates the
> LSF queuing system.
>
> As an example, I will explain my attempt to run a parallel execution
> for the TiO2 (rutile) test case. I am using the default values of
> RKmax, k-points, VXC, etc.
>
> The .machines file was created using a bespoke script that updates
> the names of the processors being used for the current job. In this
> case, I am using 16 cores on a single node. The .machines file is below:
>
> # .machines file for Wien2k
>
> #
>
> 1:sqg1cintr16.bullx:16
>
> granularity:1
>
> extrafine:1
>
> lapw0: sqg1cintr16.bullx:16
>
> dstart: sqg1cintr16.bullx:16
>
> nlvdw: sqg1cintr16.bullx:16
>
> lapw2_vector_split:2
>
> After I initialise the calculation interactively via the w2web GUI
> (i.e. not in parallel), I attempted to execute the SCF cycle in
> w2web with the parallel option selected. I received the following
> error in STDOUT:
>
> LAPW0 END
>
> [1] Done mpirun -np 16
> /lustre/scafellpike/local/apps/intel/wien2k/21.1/lapw0_mpi lapw0.def
> >> .time00
>
> LAPW1 END
>
> [1] + Done ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> tmpmach: Subscript out of range.
>
> grep: lapw2*.error: No such file or directory
>
> > stop error
>
> Note that I consistently receive this "grep: lapw2*.error" error
> when attempting to run SCF calculations in parallel! After this, I
> tested each of lapw0, lapw1 and lapw2 as single programmes (in
> parallel) to try to fix the problem. I think that lapw1 ran
> correctly, but I have given the output below just in case there is a
> problem here. There is, however, an obvious error when lapw2 is
> executed (see below).
>
> starting parallel lapw1 at Tue Jan 31 15:00:07 GMT 2023
>
> -> starting parallel LAPW1 jobs at Tue Jan 31 15:00:07 GMT 2023
>
> running LAPW1 in parallel mode (using .machines)
>
> granularity set to 1 because of nonlocal SCRATCH variable
>
> 1 number_of_parallel_jobs
>
> [1] 46212
>
> LAPW1 END
>
> [1] + Done ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> (70) 0.011u 0.027s 0:14.52 0.2% 0+0k 0+8io 0pf+0w
>
> Summary of lapw1para:
>
> sqg1cintr16.bullx k= user= wallclock=
>
> 0.100u 0.299s 0:16.85 2.3% 0+0k 616+248io 0pf+0w
>
> #lapw2 as a single programme (parallel):
>
> running LAPW2 in parallel mode
>
> tmpmach: Subscript out of range.
>
> 0.016u 0.043s 0:00.06 83.3% 0+0k 32+24io 0pf+0w
>
> error: command
> /lustre/scafellpike/local/apps/intel/wien2k/21.1/lapw2para
> lapw2.def failed
>
> Please let me know if you need any more information. I would
> particularly like to know why the errors are occurring at lapw2
> (e.g. what is the "tmpmach" error?)
>
> Many thanks,
>
> Calum Cunningham
>
> This e-mail is from the National Nuclear Laboratory Limited (NNL).
> This e-mail and any attachments are intended for the addressee and
> may also be legally privileged. If you are not the intended
> recipient please do not print, re-transmit, store or act in reliance
> on it or any attachments. Instead, please e-mail it back to the
> sender and then immediately permanently delete it. National Nuclear
> Laboratory Limited (Company Number 3857752) Registered in England
> and Wales. Registered office Chadwick House, Warrington Road,
> Birchwood Park, Warrington, WA3 6AE.
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>
> https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fzeus.theochem.tuwien.ac.at%2Fmailman%2Flistinfo%2Fwien&data=05%7C01%7Ccalum.cunningham%40uknnl.com%7C59b725661a734d510d9308db03b667f8%7C6ae79c91466c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107854182789100%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=v2fTd05OI%2BTONsjQqD5jlQh7yj7jUIBvL4pqWZC0Ns4%3D&reserved=0
> <https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fzeus.theochem.tuwien.ac.at%2Fmailman%2Flistinfo%2Fwien&data=05%7C01%7Ccalum.cunningham%40uknnl.com%7C59b725661a734d510d9308db03b667f8%7C6ae79c91466c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107854182789100%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=v2fTd05OI%2BTONsjQqD5jlQh7yj7jUIBvL4pqWZC0Ns4%3D&reserved=0>
> SEARCH the MAILING-LIST at:
>
> https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.m
> ail-archive.com%2Fwien%40zeus.theochem.tuwien.ac.at%2Findex.html&data=
> 05%7C01%7Ccalum.cunningham%40uknnl.com%7C59b725661a734d510d9308db03b66
> 7f8%7C6ae79c91466c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107854182789100%7C
> Unknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1h
> aWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kKv7k0nN7mdhh%2F%2F1l4apZ0L%2
> FZzGo%2BJMAaT4ZtsOtXTc%3D&reserved=0
> <https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.
> mail-archive.com%2Fwien%40zeus.theochem.tuwien.ac.at%2Findex.html&data
> =05%7C01%7Ccalum.cunningham%40uknnl.com%7C59b725661a734d510d9308db03b6
> 67f8%7C6ae79c91466c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107854182789100%7
> CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1
> haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kKv7k0nN7mdhh%2F%2F1l4apZ0L%
> 2FZzGo%2BJMAaT4ZtsOtXTc%3D&reserved=0>
>
>
> --
>
> Professor Laurence Marks
> Department of Materials Science and Engineering Northwestern
> University
> https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.n
> umis.northwestern.edu%2F&data=05%7C01%7Ccalum.cunningham%40uknnl.com%7
> C59b725661a734d510d9308db03b667f8%7C6ae79c91466c4c6fae9b5c2a99158a4e%7
> C0%7C0%7C638107854182789100%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata
> =jAmsrM0sgAVWf%2FFD7szFxIyXKQufl33%2Bldi7eTrVS50%3D&reserved=0
> <https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.
> numis.northwestern.edu%2F&data=05%7C01%7Ccalum.cunningham%40uknnl.com%
> 7C59b725661a734d510d9308db03b667f8%7C6ae79c91466c4c6fae9b5c2a99158a4e%
> 7C0%7C0%7C638107854182789100%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdat
> a=jAmsrM0sgAVWf%2FFD7szFxIyXKQufl33%2Bldi7eTrVS50%3D&reserved=0>
> "Research is to see what everybody else has seen, and to think what
> nobody else has thought", Albert Szent-Györgyi
>
> This e-mail is from the National Nuclear Laboratory Limited (NNL).
> This e-mail and any attachments are intended for the addressee and may
> also be legally privileged. If you are not the intended recipient
> please do not print, re-transmit, store or act in reliance on it or
> any attachments. Instead, please e-mail it back to the sender and then
> immediately permanently delete it. National Nuclear Laboratory Limited
> (Company Number 3857752) Registered in England and Wales. Registered
> office Chadwick House, Warrington Road, Birchwood Park, Warrington,
> WA3 6AE.
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fzeus.
> theochem.tuwien.ac.at%2Fmailman%2Flistinfo%2Fwien&data=05%7C01%7Ccalum
> .cunningham%40uknnl.com%7C59b725661a734d510d9308db03b667f8%7C6ae79c914
> 66c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107854182789100%7CUnknown%7CTWFpb
> GZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0
> %3D%7C3000%7C%7C%7C&sdata=v2fTd05OI%2BTONsjQqD5jlQh7yj7jUIBvL4pqWZC0Ns
> 4%3D&reserved=0 SEARCH the MAILING-LIST at:
> https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.m
> ail-archive.com%2Fwien%40zeus.theochem.tuwien.ac.at%2Findex.html&data=
> 05%7C01%7Ccalum.cunningham%40uknnl.com%7C59b725661a734d510d9308db03b66
> 7f8%7C6ae79c91466c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107854182789100%7C
> Unknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1h
> aWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kKv7k0nN7mdhh%2F%2F1l4apZ0L%2
> FZzGo%2BJMAaT4ZtsOtXTc%3D&reserved=0
--
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300
Email: peter.blaha at tuwien.ac.at WIEN2k: https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.wien2k.at%2F&data=05%7C01%7Ccalum.cunningham%40uknnl.com%7C59b725661a734d510d9308db03b667f8%7C6ae79c91466c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107854182789100%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YK3XZF%2Bh%2B2Bz4A5QdugEooVOrc%2FrqvHhf7tA860gaEo%3D&reserved=0
WWW: https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.imc.tuwien.ac.at%2F&data=05%7C01%7Ccalum.cunningham%40uknnl.com%7C59b725661a734d510d9308db03b667f8%7C6ae79c91466c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107854182789100%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zm68p7%2FaKdzvhfXBCX6GVGqyfYTdJCBCxYzBgUgOsus%3D&reserved=0
-------------------------------------------------------------------------
_______________________________________________
Wien mailing list
Wien at zeus.theochem.tuwien.ac.at
https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fzeus.theochem.tuwien.ac.at%2Fmailman%2Flistinfo%2Fwien&data=05%7C01%7Ccalum.cunningham%40uknnl.com%7C59b725661a734d510d9308db03b667f8%7C6ae79c91466c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107854182789100%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=v2fTd05OI%2BTONsjQqD5jlQh7yj7jUIBvL4pqWZC0Ns4%3D&reserved=0
SEARCH the MAILING-LIST at: https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mail-archive.com%2Fwien%40zeus.theochem.tuwien.ac.at%2Findex.html&data=05%7C01%7Ccalum.cunningham%40uknnl.com%7C59b725661a734d510d9308db03b667f8%7C6ae79c91466c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107854182789100%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kKv7k0nN7mdhh%2F%2F1l4apZ0L%2FZzGo%2BJMAaT4ZtsOtXTc%3D&reserved=0
This e-mail is from the National Nuclear Laboratory Limited (NNL). This e-mail and any attachments are intended for the addressee and may also be legally privileged. If you are not the intended recipient please do not print, re-transmit, store or act in reliance on it or any attachments. Instead, please e-mail it back to the sender and then immediately permanently delete it. National Nuclear Laboratory Limited (Company Number 3857752) Registered in England and Wales. Registered office Chadwick House, Warrington Road, Birchwood Park, Warrington, WA3 6AE.
More information about the Wien
mailing list