<html><body><div>Hi Indranil,</div><div><br></div><div>I'm sending this again this time also to the list (haven't noticed you removed it), in the hope it might be useful for someone optimizing with gfortran as well...</div><div><br></div><div>Pavel<br></div><div><br></div><blockquote data-email="pavel.ondracka@email.cz">Well,

<br>

first we need to figure out why is your serial lapw so slow...

<br>You definitely don't have the libmvec patches, however almost two min

runtime suggest that even your BLAS might be bad?

<br>

<br>In the test_case folder run:

<br>$ grep "TIME HAMILT" test_case.output1

<br>and post the output. Also please go to the Wien2k folder and send the

<br>output of 

<br>$ cat WIEN2k_OPTION

<br>and

<br>$ ldd lapw1

<br>

<br>Next Wien2k version will have this simplified, however for now some

<br>patching needs to be to be done. The other option would be to get MKL

and ifort from Intel and use it instead...

<br>

<br>Anyway if you don't want MKL, you need to download the attached patch

to the SRC_lapw1 folder in Wien2k base folder.

<br>Go to the folder, and apply the patch with (you might need the patch

<br>package for that)

<br>$ patch -p1 < lapw1.patch

<br>then set the FOPT compile flags via siteconfig to: 

<br>-ffree-form -O2 -ffree-line-length-none -march=native -ftree-vectorize

<br>-DHAVE_LIBMVEC -fopenmp

and recompile lapw1.

<br>Now when you do again

<br>$ ldd lapw1

<br>it should show line with "libmvec.so.1 => /lib64/libmvec.so.1"

<br>

Compare timings again with the test_case.

<br>Also try:

<br>$ OMP_NUM_THREADS=2 x lapw1

<br>$ OMP_NUM_THREADS=4 x lapw1

<br>

<br>And after each run show total timings as well as

<br>$ grep "TIME HAMILT" test_case.output1

<br>Hopefully, you are already linking the multithreaded Openblas (but

dunno what is the Ubuntu default)...

<br>

I'll help you with the parallel execution in the next step.

<br>

<br>Best regards

<br>Pavel

<br>

<br>On Thu, 2019-05-23 at 18:58 +0530, Indranil mal wrote:

<br>> Dear sir 

<br>> 

<br>> After running x lapw1  I got the following 

<br>> 

<br>> ~/test_case$ x lapw1

<br>> STOP  LAPW1 END

<br>> 114.577u 0.247s 1:54.82 99.9%    0+0k 0+51864io 0pf+0w

<br>> 

<br>> I am using parallel k point execution only 8 GB memory is in use and

<br>> for 100 atom (100 kpoints) calculation it is taking around 12 hours

<br>> to complete one cycle. 

<br>> please help me.      

<br>> 

<br>> Thanking you

<br>> 

<br>> Indranil 

<br>> 

<br>> On Thu, May 23, 2019 at 11:22 AM Pavel Ondračka <

<br>> pavel.ondracka@email.cz> wrote:

<br>> > Hi Indranil,

<br>> > 

<br>> > While the k-point parallelization is usually the most efficient 

<br>> > (provided you have sufficient number of k-points) and does not need

<br>> > any

<br>> > extra libraries, for 100atoms case it might be problematic to fit

<br>> > 12

<br>> > processes into 32GB of memory. I assume you are already using it

<br>> > since

> > you claim to run on two cores?

<br>> > 

<br>> > Instead check what is the maximum memory requirement of lapw1 when

<br>> > run

<br>> > in serial and based on that find how much processes you can run in

<br>> > parallel, than for each place one line "1:localhost" into .machines

<br>> > file (there is no need to copy .machines from templates, or use

<br>> > random

<br>> > scripts, instead read the userguide to understand what you are

<br>> > doing,

<br>> > it will save you time in the long run). If you can run at least few

<br>> > k-

> > points in parallel it might be enough to speed it up significantly.

<br>> > 

<br>> > For MPI you would need openmpi-devel scalapack-devel and fftw3-

<br>> > devel

> > (I'm not sure how exactly are they named on Ubuntu) packages.

<br>> > Especially the scalapack configuration could be tricky, it is

<br>> > probably

> > easiest to start with lapw0 as this needs only MPI and fftw.

<br>> > 

<br>> > Also based on my experience with default gfortran settings, it is

<br>> > likely that you don't have even optimized the single core

<br>> > performance,

<br>> > try to download the serial benchmark 

<br>> > http://susi.theochem.tuwien.ac.at/reg_user/benchmark/test_case.tar.gz

<br>> > untar, run x lapw1 and report timings (on average i7 CPU it should

<br>> > take

<br>> > below 30 seconds, if it takes significantly more, you will need

<br>> > some

<br>> > more tweaks).

<br>> > 

<br>> > Best regards

<br>> > Pavel

<br>> > 

<br>> > On Thu, 2019-05-23 at 10:42 +0530, Dr. K. C. Bhamu wrote:

<br>> > > Hii,

<br>> > > 

<br>> > > If you are doing k-point parallel calculation (having number of

<br>> > k-

<br>> > > points in IBZ more then 12) then use below script on terminal

<br>> > where

<br>> > > you want  to run the calculation or use in your job script with

<br>> > -p

<br>> > > option in run(sp)_lapw (-so).

<br>> > > 

<br>> > > if anyone knows how to repeat a nth line m times in a file then

<br>> > this

<br>> > > script can be changed.

<br>> > > 

<br>> > > Below script simply coping machine file from temple directory and

<br>> > > updating it as per your need.

<br>> > > So you do not need copy it, open it in your favorite editor and

<br>> > do it

<br>> > > manually.

<br>> > > 

<br>> > > cp $WIENROOT/SRC_templates/.machines . ; grep localhost .machines

<br>> > |

<br>> > > perl -ne 'print $_ x 6' > LOCALHOST.dat ; tail -n 2 .machines >

<br>> > > grang.dat ; sed '22,25d' .machines > MACHINE.dat ; cat

<br>> > MACHINE.dat

<br>> > > localhost.dat grang.dat > .machines ; rm LOCALHOST.dat

<br>> > MACHINE.dat

<br>> > > grang.dat

<br>> > > 

<br>> > > regards

<br>> > > Bhamu

<br>> > > 

<br>> > > 

<br>> > > On Wed, May 22, 2019 at 10:52 PM Indranil mal <

<br>> > indranil.mal@gmail.com

<br>> > > > wrote:

<br>> > > > respected sir/ Users,

<br>> > > >                     I am using a PC with intel i7 8th gen (with

<br>> > 12

<br>> > > > cores) 32GB RAM and 2TB HDD with UBUNTU 18.04 LTS. I have

<br>> > installed

<br>> > > > OpenBLAS-0.2.20 and using GNU FORTRAN and c compiler. I am

<br>> > trying

<br>> > > > to run a system with 100 atoms only two cores are using the

<br>> > rest of

<br>> > > > them are idle and the calculation taking a too long time. I

<br>> > have

<br>> > > > not installed mpi ScaLAPACK or elpa. Please help me what should

<br>> > I

> > > > do to utilize all of the cores of my cpu.

<br>> > > > 

<br>> > > > 

<br>> > > > 

<br>> > > > Thanking you 

<br>> > > > 

<br>> > > > Indranil

<br>> > > > _______________________________________________

<br>> > > > Wien mailing list

<br>> > > > Wien@zeus.theochem.tuwien.ac.at

<br>> > > > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

<br>> > > > SEARCH the MAILING-LIST at:  

<br>> > > > 

<br>> > http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

<br>> > > 

<br>> > > _______________________________________________

<br>> > > Wien mailing list

<br>> > > Wien@zeus.theochem.tuwien.ac.at

<br>> > > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

<br>> > > SEARCH the MAILING-LIST at:  

<br>> > > 

<br>> > http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

<br>> > 

<br>> > _______________________________________________

<br>> > Wien mailing list

<br>> > Wien@zeus.theochem.tuwien.ac.at

<br>> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

<br>> > SEARCH the MAILING-LIST at:  

<br>> > http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

<br></blockquote></body></html>