[Wien] Intel(R) Xeon(R) CPU X5550 @ 2.67GHz vs Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Peter Blaha
pblaha at theochem.tuwien.ac.at
Fri Oct 18 11:08:31 CEST 2013
As was mentioned before, such a big case needs mpi in order to run
efficiently.
As a "quick" small improvement set the OMP_NUM_THREAD variable to 2 or
4. This should give a speedup of about 2 and in the dayfile you should
see that not 905% of the cpu was used, but 180% or so.
On 10/18/2013 10:51 AM, Yundi Quan wrote:
> First, thank Peter. I should have described my problem thoroughly.
>
> :RKM : MATRIX SIZE 9190LOs:1944 RKM= 4.88 WEIGHT= 2.00 PGR
>
> The reduced RKM is 4.88. The reduced matrix size is 9190 which is about 2/5 of the full matrix. So that explains a lot. I'm using P1 symmetry. Therefore, the complex veion of lapw1, lapw2 are used. Compared with LDA calculations, LSDA almost doubles lapw1 and lapw2.
>
> I'm using P1 symmetry. Therefore, symmetry cannot reduce the number of stars (i.e. planes waves) in the interstitial region or the number of spherical harmonics inside the muffin-tin sphere. I guess that's why my job takes so long. And moreover, I'm only using k-point parallel without mpi.
>
> Oxygen is the smallest atom in the unit cell. Reducing RKMAX to 6.5 is what I'm going to do first.
>
> One of the clusters to which I have access has 8 cores per node and 8GB memory per node. Given the constraint of memory, I wonder how to improve the core usage when calculating compounds with large unit cells. For the compound I'm currently working on, I request one core per node and 8 nodes(=nkp) per job. So 7*8=56 cores are wasted while running my job. I'm in dire need of help.
>
>
> Yundi
>
>
> On Oct 17, 2013, at 10:58 PM, Peter Blaha <pblaha at theochem.tuwien.ac.at> wrote:
>
>> You still did not tell us the matrix size for the truncated RKmax, but yes,
>> the scaling is probably ok. (scaling goes with n^3; i.e. in case of of
>> matrix size 12000 and 24000 we expect almost a factor of 8 !!! in cpu time.
>> It also explaines the memory ....
>>
>> You also did not tell us if you have inversion or not.
>>
>> One of my real cases with NMAT= 21500 takes 400 sec on 64 cores (mpi), so one
>> could estimate something like 20000 sec on a single core, which comes to the right order
>> of magnitude compared to your case.
>>
>> And: you may have 72 inequivalent atoms, but you did not tell us how many atoms in total you have.
>> The total number of atoms is the important info !!
>>
>> Probably you can reduce RKMAX (you did not tell us which atom has RMT=1.65 (probably O ??)
>> and most likely you should use mpi AND iterative diagonalization.
>>
>> As I said, a case with 72 atoms (or whatever you have) can run in minutes on a reasonable cluster
>> and with a proper optimized setup (not just the defaults).
>>
>>
>> Am 17.10.2013 18:05, schrieb Yundi Quan:
>>> Thanks a lot.
>>> On cluster A, RKM was automatically reduced to 4.88 while on cluster B RKM was kept at 7. I didn't expect this, though I was aware that WIEN2k would automatically reduce
>>> RKM in some cases. But is it reasonable for an iteration to run for eight hours with the following parameters?
>>> Minimum sphere size: 1.65000 Bohr.
>>> Total k-mesh : 8
>>> Gmax : 12
>>>
>>> :RKM : MATRIX SIZE23486LOs:1944 RKM= 7.00 WEIGHT= 2.00 PGR:
>>> :RKM : MATRIX SIZE23486LOs:1944 RKM= 7.00 WEIGHT= 2.00 PGR:
>>>
>>>
>>> On Thu, Oct 17, 2013 at 8:54 AM, Peter Blaha <pblaha at theochem.tuwien.ac.at <mailto:pblaha at theochem.tuwien.ac.at>> wrote:
>>>
>>> The Xeon X5550 processor is a 4 core processor and your cluster may have combined a few of them on one node (2-4 ?) Anyway, 14 cores are not really possible ??
>>>
>>> Have you done more than just looking on the total time ?
>>>
>>> Is the machines file the same on both clusters ? Such a machines file does NOT use mpi.
>>>
>>> One guess in case you really use mpi on cluster B (with a different .machines file): In the sequential run (A) the basis set is limited by NMATMAX, in the mpi-parallel
>>> run it is not (or it is scaled up by sqrt(N-core)).
>>> So it could be that run A has a MUCH smaller RKMAX than run (B).
>>> Check grep :RKM case.scf of the two runs.
>>> What are the real matrix sizes ????
>>>
>>> Alternatively, NMATMAX could be chosen differently on the two machines since somebody else installed WIEN2k.
>>>
>>> Please compare carefully the resulting case.output1_1 files and eventually send the RELEVANT PARTS OF THEM.
>>>
>>>
>>> In any case, a 72 atom cell should NOT take 2 h / iteration (or even 8 ??).
>>>
>>> What are your sphere sizes ???, what gives :RKM in case.scf ???
>>>
>>> At least one can set OMP_NUM_THREAD=2 or 4 and speed up the code by a factor of almost 2. (You should see in the dayfile something close to 200 % instead of ~100%
>>>
>>> > c1208-ib(1) 26558.265u 17.956s 7:34:14.39 97.5%0+0k 0+0io 0pf+0w
>>>
>>> In essence: A matrix size of 10000 (real, with inversion) lapw1 should take in the order of 10 min (no mpi, maybe with OMP_NUM_THREAD=2)
>>>
>>>
>>>
>>> On 10/17/2013 04:33 PM, Yundi Quan wrote:
>>>
>>> Thanks for your reply.
>>> a). both machines are set up in a way that once a node is assigned to a
>>> job, it cannot be assigned to another.
>>> b). The .machines file looks like this
>>> 1:node1
>>> 1:node2
>>> 1:node3
>>> 1:node4
>>> 1:node5
>>> 1:node6
>>> 1:node7
>>> 1:node8
>>> granularity:1
>>> extrafine:1
>>> lapw2_vector_split:1
>>>
>>> I've been trying to avoid using mpi because sometime mpi can slow down
>>> my calculations because of poor communications between nodes.
>>>
>>> c). the amount of memory available to a core does not seem to be the
>>> problem in my case because my job could run smoothly on cluster A where
>>> each node has 8G memory and 8 core). But my job runs into memory
>>> problems on cluster B where each core has much more memory available. I
>>> wonder whether there are parameters which I should change in WIEN2k to
>>> reduce the memory usage.
>>>
>>> d). My dayfile for a single iteration looks like this. The wallclocks
>>> are around 500.
>>>
>>>
>>> cycle 1 (Fri Oct 11 02:14:05 PDT 2013) (40/99 to go)
>>>
>>> > lapw0 -p(02:14:05) starting parallel lapw0 at Fri Oct 11 02:14:06
>>>
>>> PDT 2013
>>> -------- .machine0 : processors
>>> running lapw0 in single mode
>>> 1431.414u 22.267s 24:14.84 99.9%0+0k 0+0io 0pf+0w
>>> > lapw1 -up -p -c(02:38:20) starting parallel lapw1 at Fri Oct 11
>>>
>>> 02:38:20 PDT 2013
>>> -> starting parallel LAPW1 jobs at Fri Oct 11 02:38:21 PDT 2013
>>> running LAPW1 in parallel mode (using .machines)
>>> 8 number_of_parallel_jobs
>>> c1208-ib(1) 26558.265u 17.956s 7:34:14.39 97.5%0+0k 0+0io 0pf+0w
>>> c1201-ib(1) 26845.212u 15.496s 7:39:59.37 97.3%0+0k 0+0io 0pf+0w
>>> c1180-ib(1) 25872.609u 18.143s 7:23:53.43 97.2%0+0k 0+0io 0pf+0w
>>> c1179-ib(1) 26040.482u 17.868s 7:26:38.66 97.2%0+0k 0+0io 0pf+0w
>>> c1178-ib(1) 26571.271u 17.946s 7:34:16.23 97.5%0+0k 0+0io 0pf+0w
>>> c1177-ib(1) 27108.070u 34.294s 8:32:55.53 88.1%0+0k 0+0io 0pf+0w
>>> c1171-ib(1) 26729.399u 14.175s 7:36:22.67 97.6%0+0k 0+0io 0pf+0w
>>> c0844-ib(1) 25883.863u 47.148s 8:12:35.54 87.7%0+0k 0+0io 0pf+0w
>>>
>>> Summary of lapw1para:
>>> c1208-ibk=1user=26558.__3wallclock=454
>>> c1201-ibk=1user=26845.__2wallclock=459
>>> c1180-ibk=1user=25872.__6wallclock=443
>>> c1179-ibk=1user=26040.__5wallclock=446
>>> c1178-ibk=1user=26571.__3wallclock=454
>>> c1177-ibk=1user=27108.__1wallclock=512
>>> c1171-ibk=1user=26729.__4wallclock=456
>>> c0844-ibk=1user=25883.__9wallclock=492
>>> 97.935u 34.265s 8:32:58.38 0.4%0+0k 0+0io 0pf+0w
>>> > lapw1 -dn -p -c(11:11:19) starting parallel lapw1 at Fri Oct 11
>>>
>>> 11:11:19 PDT 2013
>>> -> starting parallel LAPW1 jobs at Fri Oct 11 11:11:19 PDT 2013
>>> running LAPW1 in parallel mode (using .machines.help)
>>> 8 number_of_parallel_jobs
>>> c1208-ib(1) 26474.686u 16.142s 7:33:36.01 97.3%0+0k 0+0io 0pf+0w
>>> c1201-ib(1) 26099.149u 40.330s 8:04:42.58 89.8%0+0k 0+0io 0pf+0w
>>> c1180-ib(1) 26809.287u 14.724s 7:38:56.52 97.4%0+0k 0+0io 0pf+0w
>>> c1179-ib(1) 26007.527u 17.959s 7:26:10.62 97.2%0+0k 0+0io 0pf+0w
>>> c1178-ib(1) 26565.723u 17.576s 7:35:20.11 97.3%0+0k 0+0io 0pf+0w
>>> c1177-ib(1) 27114.619u 31.180s 8:21:28.34 90.2%0+0k 0+0io 0pf+0w
>>> c1171-ib(1) 26474.665u 15.309s 7:33:38.15 97.3%0+0k 0+0io 0pf+0w
>>> c0844-ib(1) 26586.569u 15.010s 7:35:22.88 97.3%0+0k 0+0io 0pf+0w
>>> Summary of lapw1para:
>>> c1208-ibk=1user=26474.__7wallclock=453
>>> c1201-ibk=1user=26099.__1wallclock=484
>>> c1180-ibk=1user=26809.__3wallclock=458
>>> c1179-ibk=1user=26007.__5wallclock=446
>>> c1178-ibk=1user=26565.__7wallclock=455
>>> c1177-ibk=1user=27114.__6wallclock=501
>>> c1171-ibk=1user=26474.__7wallclock=453
>>> c0844-ibk=1user=26586.__6wallclock=455
>>> 104.607u 18.798s 8:21:30.92 0.4%0+0k 0+0io 0pf+0w
>>>
>>> > lapw2 -up -p -c (19:32:50) running LAPW2 in parallel mode
>>> c1208-ib 1016.517u 13.674s 17:11.10 99.9% 0+0k 0+0io 0pf+0w
>>> c1201-ib 1017.359u 13.669s 17:11.82 99.9% 0+0k 0+0io 0pf+0w
>>> c1180-ib 1033.056u 13.283s 17:27.07 99.9% 0+0k 0+0io 0pf+0w
>>> c1179-ib 1037.551u 13.447s 17:31.50 99.9% 0+0k 0+0io 0pf+0w
>>> c1178-ib 1019.156u 13.729s 17:13.49 99.9% 0+0k 0+0io 0pf+0w
>>> c1177-ib 1021.878u 13.731s 17:16.07 99.9% 0+0k 0+0io 0pf+0w
>>> c1171-ib 1032.417u 13.681s 17:26.70 99.9% 0+0k 0+0io 0pf+0w
>>> c0844-ib 1022.315u 13.870s 17:16.81 99.9% 0+0k 0+0io 0pf+0w
>>> Summary of lapw2para:
>>> c1208-ibuser=1016.52wallclock=__1031.1
>>> c1201-ibuser=1017.36wallclock=__1031.82
>>> c1180-ibuser=1033.06wallclock=__1047.07
>>> c1179-ibuser=1037.55wallclock=__1051.5
>>> c1178-ibuser=1019.16wallclock=__1033.49
>>> c1177-ibuser=1021.88wallclock=__1036.07
>>> c1171-ibuser=1032.42wallclock=__1046.7
>>> c0844-ibuser=1022.32wallclock=__1036.81
>>> 31.923u 13.526s 18:20.12 4.1%0+0k 0+0io 0pf+0w
>>>
>>> > lapw2 -dn -p -c (19:51:10) running LAPW2 in parallel mode
>>> c1208-ib 947.942u 13.364s 16:01.75 99.9% 0+0k 0+0io 0pf+0w
>>> c1201-ib 932.766u 13.640s 15:49.22 99.7% 0+0k 0+0io 0pf+0w
>>> c1180-ib 932.474u 13.609s 15:47.76 99.8% 0+0k 0+0io 0pf+0w
>>> c1179-ib 936.171u 13.691s 15:50.33 99.9% 0+0k 0+0io 0pf+0w
>>> c1178-ib 947.798u 13.493s 16:04.99 99.6% 0+0k 0+0io 0pf+0w
>>> c1177-ib 947.786u 13.350s 16:04.89 99.6% 0+0k 0+0io 0pf+0w
>>> c1171-ib 930.971u 13.874s 15:45.22 99.9% 0+0k 0+0io 0pf+0w
>>> c0844-ib 950.723u 13.426s 16:04.69 99.9% 0+0k 0+0io 0pf+0w
>>> Summary of lapw2para:
>>> c1208-ibuser=947.942wallclock=__961.75
>>> c1201-ibuser=932.766wallclock=__949.22
>>> c1180-ibuser=932.474wallclock=__947.76
>>> c1179-ibuser=936.171wallclock=__950.33
>>> c1178-ibuser=947.798wallclock=__964.99
>>> c1177-ibuser=947.786wallclock=__964.89
>>> c1171-ibuser=930.971wallclock=__945.22
>>> c0844-ibuser=950.723wallclock=__964.69
>>> 31.522u 13.879s 16:53.13 4.4%0+0k 0+0io 0pf+0w
>>> > lcore -up(20:08:03) 2.993u 0.587s 0:03.75 95.2%0+0k 0+0io 0pf+0w
>>> > lcore -dn(20:08:07) 2.843u 0.687s 0:03.66 96.1%0+0k 0+0io 0pf+0w
>>> > mixer (20:08:21) 23.206u 32.513s 0:56.63 98.3%0+0k 0+0io 0pf+0w
>>>
>>> :ENERGY convergence: 0 0.00001 416.9302585700000000
>>> :CHARGE convergence: 0 0.0000 3.6278086
>>>
>>>
>>> On Thu, Oct 17, 2013 at 7:11 AM, Laurence Marks
>>> <L-marks at northwestern.edu <mailto:L-marks at northwestern.edu> <mailto:L-marks at northwestern.__edu <mailto:L-marks at northwestern.edu>>> wrote:
>>>
>>> There are so many possibilities, a few:
>>>
>>> a) If you only request 1 core/node most queuing systems (qsub/msub
>>> etc) will allocate the other cores to other jobs. You are then going
>>> to be very dependent upon what those other jobs are doing. Normal is
>>> to use all the cores on a given node.
>>>
>>> b) When you run on cluster B, in addition to a) it is going to be
>>> inefficient to run with mpi communications across nodes and it is much
>>> better to run on a given node across cores. Are you using a machines
>>> file with eight 1: nodeA lines (for instance) or one with a single 1:
>>> nodeA nodeB....? The first does not use mpi, the second does. To use
>>> mpi within a node you would use lines such as 1:node:8. Knowledge of
>>> your .machines file will help people assist you.
>>>
>>> c) The memory on those clusters is very small, whoever bought them was
>>> not thinking about large scale jobs. I look for at least 4G/core, and
>>> 2G/core is barely acceptable. You are going to have to use mpi.
>>>
>>> d) All mpi is equal, but some mpi is more equal than others. Depending
>>> upon whether you have infiniband, ethernet, openmpi, impi and how
>>> everything was compiled you can see enormous differences. One thing to
>>> look at is the difference between the cpu time and wall time (both in
>>> case.dayfile and at the bottom of case.output1_*). With a good mpi
>>> setup the wall time should be 5-10% more than the cpu time; with a bad
>>> setup it can be several times it.
>>>
>>> On Thu, Oct 17, 2013 at 8:44 AM, Yundi Quan <quanyundi at gmail.com <mailto:quanyundi at gmail.com>
>>> <mailto:quanyundi at gmail.com <mailto:quanyundi at gmail.com>>> wrote:
>>> > Hi,
>>> > I have access to two clusters as a low-level user. One cluster
>>> (cluster A)
>>> > consists of nodes with 8 core and 8 G mem per node. The other cluster
>>> > (cluster B) has 24G mem per node and each node has 14 cores or
>>> more. The
>>> > cores on cluster A are Xeon CPU E5620 at 2.40GHz, while the cores on
>>> cluster B
>>> > are Xeon CPU X5550 at 2.67GH. From the specifications (2.40GHz+12288
>>> KB cache
>>> > vs 2.67GHz+8192 KB cache), two machines should be very close in
>>> performance.
>>> > But it does not seem to be so.
>>> >
>>> > I have job with 72 atoms per unit cell. I initialized the job on
>>> cluster A
>>> > and ran it for a few iterations. Each iteration took 2 hours.
>>> Then, I moved
>>> > the job to cluster B (14 cores per node with @2.67GHz). Now it
>>> takes more
>>> > than 8 hours to finish one iteration. On both clusters, I request
>>> one core
>>> > per node and 8 nodes per job ( 8 is the number of k points). I
>>> compiled
>>> > WIEN2k_13 on cluster A without mpi. On cluster B, WIEN2k_12 was
>>> compiled by
>>> > the administrator with mpi.
>>> >
>>> > What could have caused poor performance of cluster B? Is it
>>> because of MPI?
>>> >
>>> > On an unrelated question. Sometimes memory would run out on
>>> cluster B which
>>> > has 24Gmem per node. Nevertheless the same job could run smoothly
>>> on cluster
>>> > A which only has 8 G per node.
>>> >
>>> > Thanks.
>>>
>>>
>>>
>>> --
>>> Professor Laurence Marks
>>> Department of Materials Science and Engineering
>>> Northwestern University
>>> www.numis.northwestern.edu <http://www.numis.northwestern.edu> <http://www.numis.__northwestern.edu <http://www.numis.northwestern.edu>>
>>> 1-847-491-3996 <tel:1-847-491-3996> <tel:1-847-491-3996 <tel:1-847-491-3996>>
>>>
>>> "Research is to see what everybody else has seen, and to think what
>>> nobody else has thought"
>>> Albert Szent-Gyorgi
>>> _________________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.__at <mailto:Wien at zeus.theochem.tuwien.ac.at> <mailto:Wien at zeus.theochem.__tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>>
>>>
>>> http://zeus.theochem.tuwien.__ac.at/mailman/listinfo/wien <http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
>>> SEARCH the MAILING-LIST at:
>>> http://www.mail-archive.com/__wien@zeus.theochem.tuwien.ac.__at/index.html <http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html>
>>>
>>>
>>>
>>>
>>> _________________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.__at <mailto:Wien at zeus.theochem.tuwien.ac.at>
>>> http://zeus.theochem.tuwien.__ac.at/mailman/listinfo/wien <http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
>>> SEARCH the MAILING-LIST at: http://www.mail-archive.com/__wien@zeus.theochem.tuwien.ac.__at/index.html
>>> <http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html>
>>>
>>>
>>> --
>>>
>>> P.Blaha
>>> ------------------------------__------------------------------__--------------
>>> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
>>> Phone: +43-1-58801-165300 <tel:%2B43-1-58801-165300> FAX: +43-1-58801-165982 <tel:%2B43-1-58801-165982>
>>> Email: blaha at theochem.tuwien.ac.at <mailto:blaha at theochem.tuwien.ac.at> WWW: http://info.tuwien.ac.at/__theochem/ <http://info.tuwien.ac.at/theochem/>
>>> ------------------------------__------------------------------__--------------
>>>
>>> _________________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.__at <mailto:Wien at zeus.theochem.tuwien.ac.at>
>>> http://zeus.theochem.tuwien.__ac.at/mailman/listinfo/wien <http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
>>> SEARCH the MAILING-LIST at: http://www.mail-archive.com/__wien@zeus.theochem.tuwien.ac.__at/index.html
>>> <http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>>>
>>
>> --
>> -----------------------------------------
>> Peter Blaha
>> Inst. Materials Chemistry, TU Vienna
>> Getreidemarkt 9, A-1060 Vienna, Austria
>> Tel: +43-1-5880115671
>> Fax: +43-1-5880115698
>> email: pblaha at theochem.tuwien.ac.at
>> -----------------------------------------
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
--
P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at WWW:
http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------
More information about the Wien
mailing list