[Wien] Parallel execution of SCF cycle

Tue Jan 31 19:10:11 CET 2023

You should have a definition of   WIEN_MPIRUN like

setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"

in parallel_options ??  It seems that parallel_options is not complete ??
-----------------------
setenv TASKSET "no"
if ( ! $?USE_REMOTE ) setenv USE_REMOTE 1
if ( ! $?MPI_REMOTE ) setenv MPI_REMOTE 0
setenv WIEN_GRANULARITY 1
setenv DELAY 0.1
setenv SLEEPY 1
setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"
setenv CORES_PER_NODE 1
---------------------------------

Also: It does NOT make sense to use mpi-parallelization with 16 cores 
AND   set    omp_global:16

This directs the code to span 16*16 threads on one node.
omp_xxx   is not mandatory, it will take by default 1 or the value set 
in .bashrc (at least when you ran userconfig_lapw).

-----------------------------
About the   tmpmach  error:
How does your   .processes file look like ?

It could be this is related to your parallel_options, but could also be 
your   machinename.

I never had a "." in the machine name like your    sqg1cintr20.bullx
(and I can hardly test it ...)

Am 31.01.2023 um 18:55 schrieb Calum Cunningham:
> Thanks for your quick response Laurence.
> 
> “cat $WIENROOT/parallel_options” gave the following output:
> 
> setenv TASKSET "no"
> 
> if ( ! $?USE_REMOTE ) setenv USE_REMOTE 1
> 
> if ( ! $?MPI_REMOTE ) setenv MPI_REMOTE 0
> 
> setenv WIEN_GRANULARITY 1
> 
> setenv DELAY 0.1
> 
> setenv SLEEPY 1
> 
> I believe Intel MPI is the default on the cluster we are using. I double 
> checked by loading in the Intel_mpi module and re-running, but I still 
> encounter the same errors as before.
> 
> Also, I have now modified the .machines file as you suggested (see 
> below), but the same errors still arise. (Note: I am aware that choosing 
> 16 cores for each omp_XXX may not be optimal for speed, but for now I am 
> just testing if it works)
> 
> # .machines file for Wien2k
> 
> #
> 
> omp_global:16
> 
> omp_lapw1:16
> 
> omp_lapw2:16
> 
> 1:sqg1cintr20.bullx:16
> 
> granularity:1
> 
> extrafine:1
> 
> lapw0: sqg1cintr20.bullx:16
> 
> dstart: sqg1cintr20.bullx:16
> 
> nlvdw: sqg1cintr20.bullx:16
> 
> *From:*Wien <wien-bounces at zeus.theochem.tuwien.ac.at> *On Behalf Of 
> *Laurence Marks
> *Sent:* 31 January 2023 17:11
> *To:* A Mailing list for WIEN2k users <wien at zeus.theochem.tuwien.ac.at>
> *Subject:* Re: [Wien] Parallel execution of SCF cycle
> 
> Please do "cat $WIENROOT/parallel_options", as I suspect you have an 
> issue there.
> 
> Do you have a "normal" mpirun or does your cluster require something 
> different?
> 
> Which mpirun are you using?
> 
> Also, I doubt you need "lapw2_vector_split:2", and you do not appear to 
> have set the "omp_XXX" variables which are needed for recent versions.
> 
> On Tue, Jan 31, 2023 at 10:59 AM Calum Cunningham 
> <Calum.Cunningham at uknnl.com <mailto:Calum.Cunningham at uknnl.com>> wrote:
> 
>     Dear WIEN2k users,
> 
>     My colleagues and I are having some trouble running SCF calculations
>     in parallel mode. I have had no issues when working in serial mode.
>     We are using version 21.1 on a computer cluster that operates the
>     LSF queuing system.
> 
>     As an example, I will explain my attempt to run a parallel execution
>     for the TiO2 (rutile) test case. I am using the default values of
>     RKmax, k-points, VXC, etc.
> 
>     The .machines file was created using a bespoke script that updates
>     the names of the processors being used for the current job. In this
>     case, I am using 16 cores on a single node. The .machines file is below:
> 
>     # .machines file for Wien2k
> 
>     #
> 
>     1:sqg1cintr16.bullx:16
> 
>     granularity:1
> 
>     extrafine:1
> 
>     lapw0: sqg1cintr16.bullx:16
> 
>     dstart: sqg1cintr16.bullx:16
> 
>     nlvdw: sqg1cintr16.bullx:16
> 
>     lapw2_vector_split:2
> 
>     After I initialise the calculation interactively via the w2web GUI
>     (i.e. not in parallel), I attempted to execute the SCF cycle in
>     w2web with the parallel option selected. I received the following
>     error in STDOUT:
> 
>     LAPW0 END
> 
>     [1]    Done                          mpirun -np 16
>     /lustre/scafellpike/local/apps/intel/wien2k/21.1/lapw0_mpi lapw0.def
>      >> .time00
> 
>     LAPW1 END
> 
>     [1]  + Done                          ( cd $PWD; $t $ttt; rm -f
>     .lock_$lockfile[$p] ) >> .time1_$loop
> 
>     tmpmach: Subscript out of range.
> 
>     grep: lapw2*.error: No such file or directory
> 
>     >   stop error
> 
>     Note that I consistently receive this “grep: lapw2*.error” error
>     when attempting to run SCF calculations in parallel! After this, I
>     tested each of lapw0, lapw1 and lapw2 as single programmes (in
>     parallel) to try to fix the problem. I think that lapw1 ran
>     correctly, but I have given the output below just in case there is a
>     problem here. There is, however, an obvious error when lapw2 is
>     executed (see below).
> 
>     starting parallel lapw1 at Tue Jan 31 15:00:07 GMT 2023
> 
>     ->  starting parallel LAPW1 jobs at Tue Jan 31 15:00:07 GMT 2023
> 
>     running LAPW1 in parallel mode (using .machines)
> 
>     granularity set to 1 because of nonlocal SCRATCH variable
> 
>     1 number_of_parallel_jobs
> 
>     [1] 46212
> 
>     LAPW1 END
> 
>     [1]  + Done                          ( cd $PWD; $t $ttt; rm -f
>     .lock_$lockfile[$p] ) >> .time1_$loop
> 
>           (70) 0.011u 0.027s 0:14.52 0.2%  0+0k 0+8io 0pf+0w
> 
>         Summary of lapw1para:
> 
>         sqg1cintr16.bullx  k=  user=     wallclock=
> 
>     0.100u 0.299s 0:16.85 2.3% 0+0k 616+248io 0pf+0w
> 
>     #lapw2 as a single programme (parallel):
> 
>     running LAPW2 in parallel mode
> 
>     tmpmach: Subscript out of range.
> 
>     0.016u 0.043s 0:00.06 83.3% 0+0k 32+24io 0pf+0w
> 
>     error: command  
>     /lustre/scafellpike/local/apps/intel/wien2k/21.1/lapw2para
>     lapw2.def   failed
> 
>     Please let me know if you need any more information. I would
>     particularly like to know why the errors are occurring at lapw2
>     (e.g. what is the “tmpmach” error?)
> 
>     Many thanks,
> 
>     Calum Cunningham
> 
>     This e-mail is from the National Nuclear Laboratory Limited (NNL).
>     This e-mail and any attachments are intended for the addressee and
>     may also be legally privileged. If you are not the intended
>     recipient please do not print, re-transmit, store or act in reliance
>     on it or any attachments. Instead, please e-mail it back to the
>     sender and then immediately permanently delete it. National Nuclear
>     Laboratory Limited (Company Number 3857752) Registered in England
>     and Wales. Registered office Chadwick House, Warrington Road,
>     Birchwood Park, Warrington, WA3 6AE.
> 
>     _______________________________________________
>     Wien mailing list
>     Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>
>     http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>     <https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fzeus.theochem.tuwien.ac.at%2Fmailman%2Flistinfo%2Fwien&data=05%7C01%7Ccalum.cunningham%40uknnl.com%7C57f497e7e9cc49eee6a808db03ae2e58%7C6ae79c91466c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107818857427684%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qH35gcdxAOz1jbS9d8ELfSGX8DWl9KBcS9qY4IByXiM%3D&reserved=0>
>     SEARCH the MAILING-LIST at:
>     http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html <https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mail-archive.com%2Fwien%40zeus.theochem.tuwien.ac.at%2Findex.html&data=05%7C01%7Ccalum.cunningham%40uknnl.com%7C57f497e7e9cc49eee6a808db03ae2e58%7C6ae79c91466c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107818857583875%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OU5EQcWgQ0SqMPLyBVCEwvp0jFH3%2Bai38G%2BurjRyLe4%3D&reserved=0>
> 
> 
> -- 
> 
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> www.numis.northwestern.edu 
> <https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.numis.northwestern.edu%2F&data=05%7C01%7Ccalum.cunningham%40uknnl.com%7C57f497e7e9cc49eee6a808db03ae2e58%7C6ae79c91466c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107818857583875%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Xvjz1y98sh5YY2MyN9I4904RHPiRHMoC3IEwLBInD7c%3D&reserved=0>
> "Research is to see what everybody else has seen, and to think what 
> nobody else has thought", Albert Szent-Györgyi
> 
> This e-mail is from the National Nuclear Laboratory Limited (NNL). This 
> e-mail and any attachments are intended for the addressee and may also 
> be legally privileged. If you are not the intended recipient please do 
> not print, re-transmit, store or act in reliance on it or any 
> attachments. Instead, please e-mail it back to the sender and then 
> immediately permanently delete it. National Nuclear Laboratory Limited 
> (Company Number 3857752) Registered in England and Wales. Registered 
> office Chadwick House, Warrington Road, Birchwood Park, Warrington, WA3 
> 6AE.
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

-- 
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300
Email: peter.blaha at tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at
-------------------------------------------------------------------------