[Wien] Intel(R) Xeon(R) CPU X5550 @ 2.67GHz vs Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

Fri Oct 18 11:08:31 CEST 2013

As was mentioned before, such a big case needs   mpi  in order to run 
efficiently.

As a "quick" small improvement  set the OMP_NUM_THREAD variable to 2 or 
4.   This should give a speedup of about 2 and in the dayfile you should 
see that not 905% of the cpu was used, but 180% or so.

On 10/18/2013 10:51 AM, Yundi Quan wrote:
> First, thank Peter. I should have described my problem thoroughly.
>
> :RKM  : MATRIX SIZE 9190LOs:1944  RKM= 4.88  WEIGHT= 2.00  PGR
>
> The reduced RKM is 4.88. The reduced matrix size is 9190 which is about 2/5 of the full matrix. So that explains a lot. I'm using P1 symmetry. Therefore, the complex veion of lapw1, lapw2 are used. Compared with LDA calculations, LSDA almost doubles lapw1 and lapw2.
>
> I'm using P1 symmetry. Therefore, symmetry cannot reduce the number of stars (i.e. planes waves) in the interstitial region or the number of spherical harmonics inside the muffin-tin sphere. I guess that's why my job takes so long. And moreover, I'm only using k-point parallel without mpi.
>
> Oxygen is the smallest atom in the unit cell. Reducing RKMAX to 6.5 is what I'm going to do first.
>
> One of the clusters to which I have access has 8 cores per node and 8GB memory per node. Given the constraint of memory, I wonder how to improve the core usage when calculating compounds with large unit cells. For the compound I'm currently working on, I request one core per node and 8 nodes(=nkp) per job. So 7*8=56 cores are wasted while running my job. I'm in dire need of help.
>
>
> Yundi
>
>
> On Oct 17, 2013, at 10:58 PM, Peter Blaha <pblaha at theochem.tuwien.ac.at> wrote:
>
>> You still did not tell us the matrix size for the truncated RKmax, but yes,
>> the scaling is probably ok.   (scaling goes with n^3; i.e. in case of of
>> matrix size 12000 and 24000 we expect almost a factor of 8 !!! in cpu time.
>> It also explaines the memory ....
>>
>> You also did not tell us if you have inversion or not.
>>
>> One of my real cases with  NMAT= 21500   takes 400 sec on 64 cores (mpi), so one
>> could estimate something like 20000 sec on a single core, which comes to the right order
>> of magnitude compared to your case.
>>
>> And: you may have 72 inequivalent atoms, but you did not tell us how many atoms in total you have.
>> The total number of atoms is the important info !!
>>
>> Probably you can reduce RKMAX (you did not tell us which atom has RMT=1.65 (probably O ??)
>> and most likely you should use mpi AND iterative diagonalization.
>>
>> As I said, a case with 72 atoms (or whatever you have) can run in minutes on a reasonable cluster
>> and with a proper optimized setup (not just the defaults).
>>
>>
>> Am 17.10.2013 18:05, schrieb Yundi Quan:
>>> Thanks a lot.
>>> On cluster A, RKM was automatically reduced to 4.88 while on cluster B RKM was kept at 7. I didn't expect this, though I was aware that WIEN2k would automatically reduce
>>> RKM in some cases. But is it reasonable for an iteration to run for eight hours with the following parameters?
>>> Minimum sphere size: 1.65000 Bohr.
>>> Total k-mesh : 8
>>> Gmax             : 12
>>>
>>> :RKM  : MATRIX SIZE23486LOs:1944  RKM= 7.00  WEIGHT= 2.00  PGR:
>>> :RKM  : MATRIX SIZE23486LOs:1944  RKM= 7.00  WEIGHT= 2.00  PGR:
>>>
>>>
>>> On Thu, Oct 17, 2013 at 8:54 AM, Peter Blaha <pblaha at theochem.tuwien.ac.at <mailto:pblaha at theochem.tuwien.ac.at>> wrote:
>>>
>>>     The Xeon X5550 processor is a 4 core processor and your cluster may have combined a few of them on one node (2-4 ?) Anyway, 14 cores are not really possible ??
>>>
>>>     Have you done more than just looking on the total time ?
>>>
>>>     Is the machines file the same on both clusters ? Such a machines file does NOT use  mpi.
>>>
>>>     One guess in case you really use mpi on cluster B (with a different .machines file): In the sequential run (A) the basis set is limited by NMATMAX, in the mpi-parallel
>>>     run it is not (or it is scaled up by sqrt(N-core)).
>>>     So it could be that run A has a MUCH smaller RKMAX than run (B).
>>>     Check grep :RKM case.scf   of the two runs.
>>>     What are the real matrix sizes ????
>>>
>>>     Alternatively, NMATMAX could be chosen differently on the two machines since somebody else installed WIEN2k.
>>>
>>>     Please compare carefully the resulting case.output1_1 files and eventually send the RELEVANT PARTS OF THEM.
>>>
>>>
>>>     In any case, a 72 atom cell should NOT take 2 h / iteration (or even 8 ??).
>>>
>>>     What are your sphere sizes ???, what gives :RKM in case.scf ???
>>>
>>>     At least one can set   OMP_NUM_THREAD=2 or 4   and speed up the code by a factor of almost 2. (You should see in the dayfile something close to 200 % instead of ~100%
>>>
>>>      >       c1208-ib(1) 26558.265u 17.956s 7:34:14.39 97.5%0+0k 0+0io 0pf+0w
>>>
>>>     In essence:  A matrix size of 10000 (real, with inversion) lapw1 should take in the order of 10 min  (no mpi, maybe with OMP_NUM_THREAD=2)
>>>
>>>
>>>
>>>     On 10/17/2013 04:33 PM, Yundi Quan wrote:
>>>
>>>         Thanks for your reply.
>>>         a). both machines are set up in a way that once a node is assigned to a
>>>         job, it cannot be assigned to another.
>>>         b). The .machines file looks like this
>>>         1:node1
>>>         1:node2
>>>         1:node3
>>>         1:node4
>>>         1:node5
>>>         1:node6
>>>         1:node7
>>>         1:node8
>>>         granularity:1
>>>         extrafine:1
>>>         lapw2_vector_split:1
>>>
>>>         I've been trying to avoid using mpi because sometime mpi can slow down
>>>         my calculations because of poor communications between nodes.
>>>
>>>         c). the amount of memory available to a core does not seem to be the
>>>         problem in my case because my job could run smoothly on cluster A where
>>>         each node has 8G memory and 8 core). But my job runs into memory
>>>         problems on cluster B where each core has much more memory available. I
>>>         wonder whether there are parameters which I should change in WIEN2k to
>>>         reduce the memory usage.
>>>
>>>         d). My dayfile for a single iteration looks like this. The wallclocks
>>>         are around 500.
>>>
>>>
>>>               cycle 1 (Fri Oct 11 02:14:05 PDT 2013) (40/99 to go)
>>>
>>>           >   lapw0 -p(02:14:05) starting parallel lapw0 at Fri Oct 11 02:14:06
>>>
>>>         PDT 2013
>>>         -------- .machine0 : processors
>>>         running lapw0 in single mode
>>>         1431.414u 22.267s 24:14.84 99.9%0+0k 0+0io 0pf+0w
>>>           >   lapw1  -up -p    -c(02:38:20) starting parallel lapw1 at Fri Oct 11
>>>
>>>         02:38:20 PDT 2013
>>>         ->  starting parallel LAPW1 jobs at Fri Oct 11 02:38:21 PDT 2013
>>>         running LAPW1 in parallel mode (using .machines)
>>>         8 number_of_parallel_jobs
>>>                c1208-ib(1) 26558.265u 17.956s 7:34:14.39 97.5%0+0k 0+0io 0pf+0w
>>>                c1201-ib(1) 26845.212u 15.496s 7:39:59.37 97.3%0+0k 0+0io 0pf+0w
>>>                c1180-ib(1) 25872.609u 18.143s 7:23:53.43 97.2%0+0k 0+0io 0pf+0w
>>>                c1179-ib(1) 26040.482u 17.868s 7:26:38.66 97.2%0+0k 0+0io 0pf+0w
>>>                c1178-ib(1) 26571.271u 17.946s 7:34:16.23 97.5%0+0k 0+0io 0pf+0w
>>>                c1177-ib(1) 27108.070u 34.294s 8:32:55.53 88.1%0+0k 0+0io 0pf+0w
>>>                c1171-ib(1) 26729.399u 14.175s 7:36:22.67 97.6%0+0k 0+0io 0pf+0w
>>>                c0844-ib(1) 25883.863u 47.148s 8:12:35.54 87.7%0+0k 0+0io 0pf+0w
>>>
>>>              Summary of lapw1para:
>>>              c1208-ibk=1user=26558.__3wallclock=454
>>>              c1201-ibk=1user=26845.__2wallclock=459
>>>              c1180-ibk=1user=25872.__6wallclock=443
>>>              c1179-ibk=1user=26040.__5wallclock=446
>>>              c1178-ibk=1user=26571.__3wallclock=454
>>>              c1177-ibk=1user=27108.__1wallclock=512
>>>              c1171-ibk=1user=26729.__4wallclock=456
>>>              c0844-ibk=1user=25883.__9wallclock=492
>>>         97.935u 34.265s 8:32:58.38 0.4%0+0k 0+0io 0pf+0w
>>>           >   lapw1  -dn -p    -c(11:11:19) starting parallel lapw1 at Fri Oct 11
>>>
>>>         11:11:19 PDT 2013
>>>         ->  starting parallel LAPW1 jobs at Fri Oct 11 11:11:19 PDT 2013
>>>         running LAPW1 in parallel mode (using .machines.help)
>>>         8 number_of_parallel_jobs
>>>                c1208-ib(1) 26474.686u 16.142s 7:33:36.01 97.3%0+0k 0+0io 0pf+0w
>>>                c1201-ib(1) 26099.149u 40.330s 8:04:42.58 89.8%0+0k 0+0io 0pf+0w
>>>                c1180-ib(1) 26809.287u 14.724s 7:38:56.52 97.4%0+0k 0+0io 0pf+0w
>>>                c1179-ib(1) 26007.527u 17.959s 7:26:10.62 97.2%0+0k 0+0io 0pf+0w
>>>                c1178-ib(1) 26565.723u 17.576s 7:35:20.11 97.3%0+0k 0+0io 0pf+0w
>>>                c1177-ib(1) 27114.619u 31.180s 8:21:28.34 90.2%0+0k 0+0io 0pf+0w
>>>                c1171-ib(1) 26474.665u 15.309s 7:33:38.15 97.3%0+0k 0+0io 0pf+0w
>>>                c0844-ib(1) 26586.569u 15.010s 7:35:22.88 97.3%0+0k 0+0io 0pf+0w
>>>              Summary of lapw1para:
>>>              c1208-ibk=1user=26474.__7wallclock=453
>>>              c1201-ibk=1user=26099.__1wallclock=484
>>>              c1180-ibk=1user=26809.__3wallclock=458
>>>              c1179-ibk=1user=26007.__5wallclock=446
>>>              c1178-ibk=1user=26565.__7wallclock=455
>>>              c1177-ibk=1user=27114.__6wallclock=501
>>>              c1171-ibk=1user=26474.__7wallclock=453
>>>              c0844-ibk=1user=26586.__6wallclock=455
>>>         104.607u 18.798s 8:21:30.92 0.4%0+0k 0+0io 0pf+0w
>>>
>>>           >   lapw2 -up -p   -c (19:32:50) running LAPW2 in parallel mode
>>>                 c1208-ib 1016.517u 13.674s 17:11.10 99.9% 0+0k 0+0io 0pf+0w
>>>                 c1201-ib 1017.359u 13.669s 17:11.82 99.9% 0+0k 0+0io 0pf+0w
>>>                 c1180-ib 1033.056u 13.283s 17:27.07 99.9% 0+0k 0+0io 0pf+0w
>>>                 c1179-ib 1037.551u 13.447s 17:31.50 99.9% 0+0k 0+0io 0pf+0w
>>>                 c1178-ib 1019.156u 13.729s 17:13.49 99.9% 0+0k 0+0io 0pf+0w
>>>                 c1177-ib 1021.878u 13.731s 17:16.07 99.9% 0+0k 0+0io 0pf+0w
>>>                 c1171-ib 1032.417u 13.681s 17:26.70 99.9% 0+0k 0+0io 0pf+0w
>>>                 c0844-ib 1022.315u 13.870s 17:16.81 99.9% 0+0k 0+0io 0pf+0w
>>>              Summary of lapw2para:
>>>              c1208-ibuser=1016.52wallclock=__1031.1
>>>              c1201-ibuser=1017.36wallclock=__1031.82
>>>              c1180-ibuser=1033.06wallclock=__1047.07
>>>              c1179-ibuser=1037.55wallclock=__1051.5
>>>              c1178-ibuser=1019.16wallclock=__1033.49
>>>              c1177-ibuser=1021.88wallclock=__1036.07
>>>              c1171-ibuser=1032.42wallclock=__1046.7
>>>              c0844-ibuser=1022.32wallclock=__1036.81
>>>         31.923u 13.526s 18:20.12 4.1%0+0k 0+0io 0pf+0w
>>>
>>>           >   lapw2 -dn -p   -c (19:51:10) running LAPW2 in parallel mode
>>>                 c1208-ib 947.942u 13.364s 16:01.75 99.9% 0+0k 0+0io 0pf+0w
>>>                 c1201-ib 932.766u 13.640s 15:49.22 99.7% 0+0k 0+0io 0pf+0w
>>>                 c1180-ib 932.474u 13.609s 15:47.76 99.8% 0+0k 0+0io 0pf+0w
>>>                 c1179-ib 936.171u 13.691s 15:50.33 99.9% 0+0k 0+0io 0pf+0w
>>>                 c1178-ib 947.798u 13.493s 16:04.99 99.6% 0+0k 0+0io 0pf+0w
>>>                 c1177-ib 947.786u 13.350s 16:04.89 99.6% 0+0k 0+0io 0pf+0w
>>>                 c1171-ib 930.971u 13.874s 15:45.22 99.9% 0+0k 0+0io 0pf+0w
>>>                 c0844-ib 950.723u 13.426s 16:04.69 99.9% 0+0k 0+0io 0pf+0w
>>>              Summary of lapw2para:
>>>              c1208-ibuser=947.942wallclock=__961.75
>>>              c1201-ibuser=932.766wallclock=__949.22
>>>              c1180-ibuser=932.474wallclock=__947.76
>>>              c1179-ibuser=936.171wallclock=__950.33
>>>              c1178-ibuser=947.798wallclock=__964.99
>>>              c1177-ibuser=947.786wallclock=__964.89
>>>              c1171-ibuser=930.971wallclock=__945.22
>>>              c0844-ibuser=950.723wallclock=__964.69
>>>         31.522u 13.879s 16:53.13 4.4%0+0k 0+0io 0pf+0w
>>>           >   lcore -up(20:08:03) 2.993u 0.587s 0:03.75 95.2%0+0k 0+0io 0pf+0w
>>>           >   lcore -dn(20:08:07) 2.843u 0.687s 0:03.66 96.1%0+0k 0+0io 0pf+0w
>>>           >   mixer (20:08:21) 23.206u 32.513s 0:56.63 98.3%0+0k 0+0io 0pf+0w
>>>
>>>         :ENERGY convergence:  0 0.00001 416.9302585700000000
>>>         :CHARGE convergence:  0 0.0000 3.6278086
>>>
>>>
>>>         On Thu, Oct 17, 2013 at 7:11 AM, Laurence Marks
>>>         <L-marks at northwestern.edu <mailto:L-marks at northwestern.edu> <mailto:L-marks at northwestern.__edu <mailto:L-marks at northwestern.edu>>> wrote:
>>>
>>>              There are so many possibilities, a few:
>>>
>>>              a) If you only request 1 core/node most queuing systems (qsub/msub
>>>              etc) will allocate the other cores to other jobs. You are then going
>>>              to be very dependent upon what those other jobs are doing. Normal is
>>>              to use all the cores on a given node.
>>>
>>>              b) When you run on cluster B, in addition to a) it is going to be
>>>              inefficient to run with mpi communications across nodes and it is much
>>>              better to run on a given node across cores. Are you using a machines
>>>              file with eight 1: nodeA lines (for instance) or one with a single 1:
>>>              nodeA nodeB....? The first does not use mpi, the second does. To use
>>>              mpi within a node you would use lines such as 1:node:8. Knowledge of
>>>              your .machines file will help people assist you.
>>>
>>>              c) The memory on those clusters is very small, whoever bought them was
>>>              not thinking about large scale jobs. I look for at least 4G/core, and
>>>              2G/core is barely acceptable. You are going to have to use mpi.
>>>
>>>              d) All mpi is equal, but some mpi is more equal than others. Depending
>>>              upon whether you have infiniband, ethernet, openmpi, impi and how
>>>              everything was compiled you can see enormous differences. One thing to
>>>              look at is the difference between the cpu time and wall time (both in
>>>              case.dayfile and at the bottom of case.output1_*). With a good mpi
>>>              setup the wall time should be 5-10% more than the cpu time; with a bad
>>>              setup it can be several times it.
>>>
>>>              On Thu, Oct 17, 2013 at 8:44 AM, Yundi Quan <quanyundi at gmail.com <mailto:quanyundi at gmail.com>
>>>              <mailto:quanyundi at gmail.com <mailto:quanyundi at gmail.com>>> wrote:
>>>               > Hi,
>>>               > I have access to two clusters as a low-level user. One cluster
>>>              (cluster A)
>>>               > consists of nodes with 8 core and 8 G mem per node. The other cluster
>>>               > (cluster B) has 24G mem per node and each node has 14 cores or
>>>              more. The
>>>               > cores on cluster A are Xeon CPU E5620 at 2.40GHz, while the cores on
>>>              cluster B
>>>               > are Xeon CPU X5550 at 2.67GH. From the specifications (2.40GHz+12288
>>>              KB cache
>>>               > vs 2.67GHz+8192 KB cache), two machines should be very close in
>>>              performance.
>>>               > But it does not seem to be so.
>>>               >
>>>               > I have job with 72 atoms per unit cell. I initialized the job on
>>>              cluster A
>>>               > and ran it for a few iterations. Each iteration took 2 hours.
>>>              Then, I moved
>>>               > the job to cluster B (14 cores per node with @2.67GHz). Now it
>>>              takes more
>>>               > than 8 hours to finish one iteration. On both clusters, I request
>>>              one core
>>>               > per node and 8 nodes per job ( 8 is the number of k points). I
>>>              compiled
>>>               > WIEN2k_13 on cluster A without mpi. On cluster B, WIEN2k_12 was
>>>              compiled by
>>>               > the administrator with mpi.
>>>               >
>>>               > What could have caused poor performance of cluster B? Is it
>>>              because of MPI?
>>>               >
>>>               > On an unrelated question. Sometimes memory would run out on
>>>              cluster B which
>>>               > has 24Gmem per node. Nevertheless the same job could run smoothly
>>>              on cluster
>>>               > A which only has 8 G per node.
>>>               >
>>>               > Thanks.
>>>
>>>
>>>
>>>              --
>>>              Professor Laurence Marks
>>>              Department of Materials Science and Engineering
>>>              Northwestern University
>>>         www.numis.northwestern.edu <http://www.numis.northwestern.edu> <http://www.numis.__northwestern.edu <http://www.numis.northwestern.edu>>
>>>         1-847-491-3996 <tel:1-847-491-3996> <tel:1-847-491-3996 <tel:1-847-491-3996>>
>>>
>>>              "Research is to see what everybody else has seen, and to think what
>>>              nobody else has thought"
>>>              Albert Szent-Gyorgi
>>>              _________________________________________________
>>>              Wien mailing list
>>>         Wien at zeus.theochem.tuwien.ac.__at <mailto:Wien at zeus.theochem.tuwien.ac.at> <mailto:Wien at zeus.theochem.__tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>>
>>>
>>>         http://zeus.theochem.tuwien.__ac.at/mailman/listinfo/wien <http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
>>>              SEARCH the MAILING-LIST at:
>>>         http://www.mail-archive.com/__wien@zeus.theochem.tuwien.ac.__at/index.html <http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html>
>>>
>>>
>>>
>>>
>>>         _________________________________________________
>>>         Wien mailing list
>>>         Wien at zeus.theochem.tuwien.ac.__at <mailto:Wien at zeus.theochem.tuwien.ac.at>
>>>         http://zeus.theochem.tuwien.__ac.at/mailman/listinfo/wien <http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
>>>         SEARCH the MAILING-LIST at: http://www.mail-archive.com/__wien@zeus.theochem.tuwien.ac.__at/index.html
>>>         <http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html>
>>>
>>>
>>>     --
>>>
>>>                                            P.Blaha
>>>     ------------------------------__------------------------------__--------------
>>>     Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
>>>     Phone: +43-1-58801-165300 <tel:%2B43-1-58801-165300>             FAX: +43-1-58801-165982 <tel:%2B43-1-58801-165982>
>>>     Email: blaha at theochem.tuwien.ac.at <mailto:blaha at theochem.tuwien.ac.at>    WWW: http://info.tuwien.ac.at/__theochem/ <http://info.tuwien.ac.at/theochem/>
>>>     ------------------------------__------------------------------__--------------
>>>
>>>     _________________________________________________
>>>     Wien mailing list
>>>     Wien at zeus.theochem.tuwien.ac.__at <mailto:Wien at zeus.theochem.tuwien.ac.at>
>>>     http://zeus.theochem.tuwien.__ac.at/mailman/listinfo/wien <http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
>>>     SEARCH the MAILING-LIST at: http://www.mail-archive.com/__wien@zeus.theochem.tuwien.ac.__at/index.html
>>>     <http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>>>
>>
>> --
>> -----------------------------------------
>> Peter Blaha
>> Inst. Materials Chemistry, TU Vienna
>> Getreidemarkt 9, A-1060 Vienna, Austria
>> Tel: +43-1-5880115671
>> Fax: +43-1-5880115698
>> email: pblaha at theochem.tuwien.ac.at
>> -----------------------------------------
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WWW: 
http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------