[Wien] Parallel execution of SCF cycle

Tue Jan 31 17:59:13 CET 2023

Dear WIEN2k users,

My colleagues and I are having some trouble running SCF calculations in parallel mode. I have had no issues when working in serial mode. We are using version 21.1 on a computer cluster that operates the LSF queuing system.

As an example, I will explain my attempt to run a parallel execution for the TiO2 (rutile) test case. I am using the default values of RKmax, k-points, VXC, etc.

The .machines file was created using a bespoke script that updates the names of the processors being used for the current job. In this case, I am using 16 cores on a single node. The .machines file is below:

# .machines file for Wien2k
#
1:sqg1cintr16.bullx:16
granularity:1
extrafine:1

lapw0: sqg1cintr16.bullx:16

dstart: sqg1cintr16.bullx:16

nlvdw: sqg1cintr16.bullx:16

lapw2_vector_split:2

After I initialise the calculation interactively via the w2web GUI (i.e. not in parallel), I attempted to execute the SCF cycle in w2web with the parallel option selected. I received the following error in STDOUT:

LAPW0 END
[1]    Done                          mpirun -np 16 /lustre/scafellpike/local/apps/intel/wien2k/21.1/lapw0_mpi lapw0.def >> .time00
LAPW1 END
[1]  + Done                          ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
tmpmach: Subscript out of range.
grep: lapw2*.error: No such file or directory

>   stop error

Note that I consistently receive this "grep: lapw2*.error" error when attempting to run SCF calculations in parallel! After this, I tested each of lapw0, lapw1 and lapw2 as single programmes (in parallel) to try to fix the problem. I think that lapw1 ran correctly, but I have given the output below just in case there is a problem here. There is, however, an obvious error when lapw2 is executed (see below).

starting parallel lapw1 at Tue Jan 31 15:00:07 GMT 2023
->  starting parallel LAPW1 jobs at Tue Jan 31 15:00:07 GMT 2023
running LAPW1 in parallel mode (using .machines)
granularity set to 1 because of nonlocal SCRATCH variable
1 number_of_parallel_jobs
[1] 46212
LAPW1 END
[1]  + Done                          ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
     (70) 0.011u 0.027s 0:14.52 0.2%  0+0k 0+8io 0pf+0w
   Summary of lapw1para:
   sqg1cintr16.bullx  k=  user=     wallclock=
0.100u 0.299s 0:16.85 2.3% 0+0k 616+248io 0pf+0w

#lapw2 as a single programme (parallel):
running LAPW2 in parallel mode
tmpmach: Subscript out of range.
0.016u 0.043s 0:00.06 83.3% 0+0k 32+24io 0pf+0w
error: command   /lustre/scafellpike/local/apps/intel/wien2k/21.1/lapw2para lapw2.def   failed

Please let me know if you need any more information. I would particularly like to know why the errors are occurring at lapw2 (e.g. what is the "tmpmach" error?)

Many thanks,
Calum Cunningham
This e-mail is from the National Nuclear Laboratory Limited (NNL). This e-mail and any attachments are intended for the addressee and may also be legally privileged. If you are not the intended recipient please do not print, re-transmit, store or act in reliance on it or any attachments. Instead, please e-mail it back to the sender and then immediately permanently delete it. National Nuclear Laboratory Limited (Company Number 3857752) Registered in England and Wales. Registered office Chadwick House, Warrington Road, Birchwood Park, Warrington, WA3 6AE.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20230131/1f7c9349/attachment-0001.htm>