[Wien] Parallel execution of SCF cycle

Tue Jan 31 18:55:52 CET 2023

Thanks for your quick response Laurence.

"cat $WIENROOT/parallel_options" gave the following output:

setenv TASKSET "no"
if ( ! $?USE_REMOTE ) setenv USE_REMOTE 1
if ( ! $?MPI_REMOTE ) setenv MPI_REMOTE 0
setenv WIEN_GRANULARITY 1
setenv DELAY 0.1
setenv SLEEPY 1

I believe Intel MPI is the default on the cluster we are using. I double checked by loading in the Intel_mpi module and re-running, but I still encounter the same errors as before.

Also, I have now modified the .machines file as you suggested (see below), but the same errors still arise. (Note: I am aware that choosing 16 cores for each omp_XXX may not be optimal for speed, but for now I am just testing if it works)

# .machines file for Wien2k
#
omp_global:16
omp_lapw1:16
omp_lapw2:16

1:sqg1cintr20.bullx:16
granularity:1
extrafine:1

lapw0: sqg1cintr20.bullx:16

dstart: sqg1cintr20.bullx:16

nlvdw: sqg1cintr20.bullx:16

From: Wien <wien-bounces at zeus.theochem.tuwien.ac.at> On Behalf Of Laurence Marks
Sent: 31 January 2023 17:11
To: A Mailing list for WIEN2k users <wien at zeus.theochem.tuwien.ac.at>
Subject: Re: [Wien] Parallel execution of SCF cycle

Please do "cat $WIENROOT/parallel_options", as I suspect you have an issue there.
Do you have a "normal" mpirun or does your cluster require something different?
Which mpirun are you using?

Also, I doubt you need "lapw2_vector_split:2", and you do not appear to have set the "omp_XXX" variables which are needed for recent versions.

On Tue, Jan 31, 2023 at 10:59 AM Calum Cunningham <Calum.Cunningham at uknnl.com<mailto:Calum.Cunningham at uknnl.com>> wrote:
Dear WIEN2k users,

My colleagues and I are having some trouble running SCF calculations in parallel mode. I have had no issues when working in serial mode. We are using version 21.1 on a computer cluster that operates the LSF queuing system.

As an example, I will explain my attempt to run a parallel execution for the TiO2 (rutile) test case. I am using the default values of RKmax, k-points, VXC, etc.

The .machines file was created using a bespoke script that updates the names of the processors being used for the current job. In this case, I am using 16 cores on a single node. The .machines file is below:

# .machines file for Wien2k
#
1:sqg1cintr16.bullx:16
granularity:1
extrafine:1

lapw0: sqg1cintr16.bullx:16

dstart: sqg1cintr16.bullx:16

nlvdw: sqg1cintr16.bullx:16

lapw2_vector_split:2

After I initialise the calculation interactively via the w2web GUI (i.e. not in parallel), I attempted to execute the SCF cycle in w2web with the parallel option selected. I received the following error in STDOUT:

LAPW0 END
[1]    Done                          mpirun -np 16 /lustre/scafellpike/local/apps/intel/wien2k/21.1/lapw0_mpi lapw0.def >> .time00
LAPW1 END
[1]  + Done                          ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
tmpmach: Subscript out of range.
grep: lapw2*.error: No such file or directory

>   stop error

Note that I consistently receive this "grep: lapw2*.error" error when attempting to run SCF calculations in parallel! After this, I tested each of lapw0, lapw1 and lapw2 as single programmes (in parallel) to try to fix the problem. I think that lapw1 ran correctly, but I have given the output below just in case there is a problem here. There is, however, an obvious error when lapw2 is executed (see below).

starting parallel lapw1 at Tue Jan 31 15:00:07 GMT 2023
->  starting parallel LAPW1 jobs at Tue Jan 31 15:00:07 GMT 2023
running LAPW1 in parallel mode (using .machines)
granularity set to 1 because of nonlocal SCRATCH variable
1 number_of_parallel_jobs
[1] 46212
LAPW1 END
[1]  + Done                          ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
     (70) 0.011u 0.027s 0:14.52 0.2%  0+0k 0+8io 0pf+0w
   Summary of lapw1para:
   sqg1cintr16.bullx  k=  user=     wallclock=
0.100u 0.299s 0:16.85 2.3% 0+0k 616+248io 0pf+0w

#lapw2 as a single programme (parallel):
running LAPW2 in parallel mode
tmpmach: Subscript out of range.
0.016u 0.043s 0:00.06 83.3% 0+0k 32+24io 0pf+0w
error: command   /lustre/scafellpike/local/apps/intel/wien2k/21.1/lapw2para lapw2.def   failed

Please let me know if you need any more information. I would particularly like to know why the errors are occurring at lapw2 (e.g. what is the "tmpmach" error?)

Many thanks,
Calum Cunningham
This e-mail is from the National Nuclear Laboratory Limited (NNL). This e-mail and any attachments are intended for the addressee and may also be legally privileged. If you are not the intended recipient please do not print, re-transmit, store or act in reliance on it or any attachments. Instead, please e-mail it back to the sender and then immediately permanently delete it. National Nuclear Laboratory Limited (Company Number 3857752) Registered in England and Wales. Registered office Chadwick House, Warrington Road, Birchwood Park, Warrington, WA3 6AE.
_______________________________________________
Wien mailing list
Wien at zeus.theochem.tuwien.ac.at<mailto:Wien at zeus.theochem.tuwien.ac.at>
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien<https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fzeus.theochem.tuwien.ac.at%2Fmailman%2Flistinfo%2Fwien&data=05%7C01%7Ccalum.cunningham%40uknnl.com%7C57f497e7e9cc49eee6a808db03ae2e58%7C6ae79c91466c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107818857427684%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qH35gcdxAOz1jbS9d8ELfSGX8DWl9KBcS9qY4IByXiM%3D&reserved=0>
SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html<https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mail-archive.com%2Fwien%40zeus.theochem.tuwien.ac.at%2Findex.html&data=05%7C01%7Ccalum.cunningham%40uknnl.com%7C57f497e7e9cc49eee6a808db03ae2e58%7C6ae79c91466c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107818857583875%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OU5EQcWgQ0SqMPLyBVCEwvp0jFH3%2Bai38G%2BurjRyLe4%3D&reserved=0>

--
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu<https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.numis.northwestern.edu%2F&data=05%7C01%7Ccalum.cunningham%40uknnl.com%7C57f497e7e9cc49eee6a808db03ae2e58%7C6ae79c91466c4c6fae9b5c2a99158a4e%7C0%7C0%7C638107818857583875%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Xvjz1y98sh5YY2MyN9I4904RHPiRHMoC3IEwLBInD7c%3D&reserved=0>
"Research is to see what everybody else has seen, and to think what nobody else has thought", Albert Szent-Györgyi
This e-mail is from the National Nuclear Laboratory Limited (NNL). This e-mail and any attachments are intended for the addressee and may also be legally privileged. If you are not the intended recipient please do not print, re-transmit, store or act in reliance on it or any attachments. Instead, please e-mail it back to the sender and then immediately permanently delete it. National Nuclear Laboratory Limited (Company Number 3857752) Registered in England and Wales. Registered office Chadwick House, Warrington Road, Birchwood Park, Warrington, WA3 6AE.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20230131/c52dc5a4/attachment.htm>