[Wien] Parallel execution of SCF cycle
Peter Blaha
peter.blaha at tuwien.ac.at
Tue Jan 31 19:10:11 CET 2023
You should have a definition of WIEN_MPIRUN like
setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"
in parallel_options ?? It seems that parallel_options is not complete ??
-----------------------
setenv TASKSET "no"
if ( ! $?USE_REMOTE ) setenv USE_REMOTE 1
if ( ! $?MPI_REMOTE ) setenv MPI_REMOTE 0
setenv WIEN_GRANULARITY 1
setenv DELAY 0.1
setenv SLEEPY 1
setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"
setenv CORES_PER_NODE 1
---------------------------------
Also: It does NOT make sense to use mpi-parallelization with 16 cores
AND set omp_global:16
This directs the code to span 16*16 threads on one node.
omp_xxx is not mandatory, it will take by default 1 or the value set
in .bashrc (at least when you ran userconfig_lapw).
-----------------------------
About the tmpmach error:
How does your .processes file look like ?
It could be this is related to your parallel_options, but could also be
your machinename.
I never had a "." in the machine name like your sqg1cintr20.bullx
(and I can hardly test it ...)
Am 31.01.2023 um 18:55 schrieb Calum Cunningham:
> Thanks for your quick response Laurence.
>
> “cat $WIENROOT/parallel_options” gave the following output:
>
> setenv TASKSET "no"
>
> if ( ! $?USE_REMOTE ) setenv USE_REMOTE 1
>
> if ( ! $?MPI_REMOTE ) setenv MPI_REMOTE 0
>
> setenv WIEN_GRANULARITY 1
>
> setenv DELAY 0.1
>
> setenv SLEEPY 1
>
> I believe Intel MPI is the default on the cluster we are using. I double
> checked by loading in the Intel_mpi module and re-running, but I still
> encounter the same errors as before.
>
> Also, I have now modified the .machines file as you suggested (see
> below), but the same errors still arise. (Note: I am aware that choosing
> 16 cores for each omp_XXX may not be optimal for speed, but for now I am
> just testing if it works)
>
> # .machines file for Wien2k
>
> #
>
> omp_global:16
>
> omp_lapw1:16
>
> omp_lapw2:16
>
> 1:sqg1cintr20.bullx:16
>
> granularity:1
>
> extrafine:1
>
> lapw0: sqg1cintr20.bullx:16
>
> dstart: sqg1cintr20.bullx:16
>
> nlvdw: sqg1cintr20.bullx:16
>
> *From:*Wien <wien-bounces at zeus.theochem.tuwien.ac.at> *On Behalf Of
> *Laurence Marks
> *Sent:* 31 January 2023 17:11
> *To:* A Mailing list for WIEN2k users <wien at zeus.theochem.tuwien.ac.at>
> *Subject:* Re: [Wien] Parallel execution of SCF cycle
>
> Please do "cat $WIENROOT/parallel_options", as I suspect you have an
> issue there.
>
> Do you have a "normal" mpirun or does your cluster require something
> different?
>
> Which mpirun are you using?
>
> Also, I doubt you need "lapw2_vector_split:2", and you do not appear to
> have set the "omp_XXX" variables which are needed for recent versions.
>
> On Tue, Jan 31, 2023 at 10:59 AM Calum Cunningham
> <Calum.Cunningham at uknnl.com <mailto:Calum.Cunningham at uknnl.com>> wrote:
>
> Dear WIEN2k users,
>
> My colleagues and I are having some trouble running SCF calculations
> in parallel mode. I have had no issues when working in serial mode.
> We are using version 21.1 on a computer cluster that operates the
> LSF queuing system.
>
> As an example, I will explain my attempt to run a parallel execution
> for the TiO2 (rutile) test case. I am using the default values of
> RKmax, k-points, VXC, etc.
>
> The .machines file was created using a bespoke script that updates
> the names of the processors being used for the current job. In this
> case, I am using 16 cores on a single node. The .machines file is below:
>
> # .machines file for Wien2k
>
> #
>
> 1:sqg1cintr16.bullx:16
>
> granularity:1
>
> extrafine:1
>
> lapw0: sqg1cintr16.bullx:16
>
> dstart: sqg1cintr16.bullx:16
>
> nlvdw: sqg1cintr16.bullx:16
>
> lapw2_vector_split:2
>
> After I initialise the calculation interactively via the w2web GUI
> (i.e. not in parallel), I attempted to execute the SCF cycle in
> w2web with the parallel option selected. I received the following
> error in STDOUT:
>
> LAPW0 END
>
> [1] Done mpirun -np 16
> /lustre/scafellpike/local/apps/intel/wien2k/21.1/lapw0_mpi lapw0.def
> >> .time00
>
> LAPW1 END
>
> [1] + Done ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> tmpmach: Subscript out of range.
>
> grep: lapw2*.error: No such file or directory
>
> > stop error
>
> Note that I consistently receive this “grep: lapw2*.error” error
> when attempting to run SCF calculations in parallel! After this, I
> tested each of lapw0, lapw1 and lapw2 as single programmes (in
> parallel) to try to fix the problem. I think that lapw1 ran
> correctly, but I have given the output below just in case there is a
> problem here. There is, however, an obvious error when lapw2 is
> executed (see below).
>
> starting parallel lapw1 at Tue Jan 31 15:00:07 GMT 2023
>
> -> starting parallel LAPW1 jobs at Tue Jan 31 15:00:07 GMT 2023
>
> running LAPW1 in parallel mode (using .machines)
>
> granularity set to 1 because of nonlocal SCRATCH variable
>
> 1 number_of_parallel_jobs
>
> [1] 46212
>
> LAPW1 END
>
> [1] + Done ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> (70) 0.011u 0.027s 0:14.52 0.2% 0+0k 0+8io 0pf+0w
>
> Summary of lapw1para:
>
> sqg1cintr16.bullx k= user= wallclock=
>
> 0.100u 0.299s 0:16.85 2.3% 0+0k 616+248io 0pf+0w
>
> #lapw2 as a single programme (parallel):
>
> running LAPW2 in parallel mode
>
> tmpmach: Subscript out of range.
>
> 0.016u 0.043s 0:00.06 83.3% 0+0k 32+24io 0pf+0w
>
> error: command
> /lustre/scafellpike/local/apps/intel/wien2k/21.1/lapw2para
> lapw2.def failed
>
> Please let me know if you need any more information. I would
> particularly like to know why the errors are occurring at lapw2
> (e.g. what is the “tmpmach” error?)
>
> Many thanks,
>
> Calum Cunningham
>
> This e-mail is from the National Nuclear Laboratory Limited (NNL).
> This e-mail and any attachments are intended for the addressee and
> may also be legally privileged. If you are not the intended
> recipient please do not print, re-transmit, store or act in reliance
> on it or any attachments. Instead, please e-mail it back to the
> sender and then immediately permanently delete it. National Nuclear
> Laboratory Limited (Company Number 3857752) Registered in England
> and Wales. Registered office Chadwick House, Warrington Road,
> Birchwood Park, Warrington, WA3 6AE.
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> <https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fzeus.theochem.tuwien.ac.at%2Fmailman%2Flistinfo%2Fwien&data=05%7C01%7Ccalum.cunningham%40uknnl.com%7C57f497e7e9cc49eee6a808db03ae2e58%7C6ae79c91466c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107818857427684%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qH35gcdxAOz1jbS9d8ELfSGX8DWl9KBcS9qY4IByXiM%3D&reserved=0>
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html <https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mail-archive.com%2Fwien%40zeus.theochem.tuwien.ac.at%2Findex.html&data=05%7C01%7Ccalum.cunningham%40uknnl.com%7C57f497e7e9cc49eee6a808db03ae2e58%7C6ae79c91466c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107818857583875%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OU5EQcWgQ0SqMPLyBVCEwvp0jFH3%2Bai38G%2BurjRyLe4%3D&reserved=0>
>
>
> --
>
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> www.numis.northwestern.edu
> <https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.numis.northwestern.edu%2F&data=05%7C01%7Ccalum.cunningham%40uknnl.com%7C57f497e7e9cc49eee6a808db03ae2e58%7C6ae79c91466c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107818857583875%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Xvjz1y98sh5YY2MyN9I4904RHPiRHMoC3IEwLBInD7c%3D&reserved=0>
> "Research is to see what everybody else has seen, and to think what
> nobody else has thought", Albert Szent-Györgyi
>
> This e-mail is from the National Nuclear Laboratory Limited (NNL). This
> e-mail and any attachments are intended for the addressee and may also
> be legally privileged. If you are not the intended recipient please do
> not print, re-transmit, store or act in reliance on it or any
> attachments. Instead, please e-mail it back to the sender and then
> immediately permanently delete it. National Nuclear Laboratory Limited
> (Company Number 3857752) Registered in England and Wales. Registered
> office Chadwick House, Warrington Road, Birchwood Park, Warrington, WA3
> 6AE.
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
--
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300
Email: peter.blaha at tuwien.ac.at WIEN2k: http://www.wien2k.at
WWW: http://www.imc.tuwien.ac.at
-------------------------------------------------------------------------
More information about the Wien
mailing list