[Wien] Intel(R) Xeon(R) CPU X5550 @ 2.67GHz vs Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Yundi Quan
quanyundi at gmail.com
Thu Oct 17 18:05:45 CEST 2013
Thanks a lot.
On cluster A, RKM was automatically reduced to 4.88 while on cluster B RKM
was kept at 7. I didn't expect this, though I was aware that WIEN2k would
automatically reduce RKM in some cases. But is it reasonable for an
iteration to run for eight hours with the following parameters?
Minimum sphere size: 1.65000 Bohr.
Total k-mesh : 8
Gmax : 12
:RKM : MATRIX SIZE23486LOs:1944 RKM= 7.00 WEIGHT= 2.00 PGR:
:RKM : MATRIX SIZE23486LOs:1944 RKM= 7.00 WEIGHT= 2.00 PGR:
On Thu, Oct 17, 2013 at 8:54 AM, Peter Blaha
<pblaha at theochem.tuwien.ac.at>wrote:
> The Xeon X5550 processor is a 4 core processor and your cluster may have
> combined a few of them on one node (2-4 ?) Anyway, 14 cores are not really
> possible ??
>
> Have you done more than just looking on the total time ?
>
> Is the machines file the same on both clusters ? Such a machines file does
> NOT use mpi.
>
> One guess in case you really use mpi on cluster B (with a different
> .machines file): In the sequential run (A) the basis set is limited by
> NMATMAX, in the mpi-parallel run it is not (or it is scaled up by
> sqrt(N-core)).
> So it could be that run A has a MUCH smaller RKMAX than run (B).
> Check grep :RKM case.scf of the two runs.
> What are the real matrix sizes ????
>
> Alternatively, NMATMAX could be chosen differently on the two machines
> since somebody else installed WIEN2k.
>
> Please compare carefully the resulting case.output1_1 files and eventually
> send the RELEVANT PARTS OF THEM.
>
>
> In any case, a 72 atom cell should NOT take 2 h / iteration (or even 8 ??).
>
> What are your sphere sizes ???, what gives :RKM in case.scf ???
>
> At least one can set OMP_NUM_THREAD=2 or 4 and speed up the code by a
> factor of almost 2. (You should see in the dayfile something close to 200 %
> instead of ~100%
>
> > c1208-ib(1) 26558.265u 17.956s 7:34:14.39 97.5%0+0k 0+0io 0pf+0w
>
> In essence: A matrix size of 10000 (real, with inversion) lapw1 should
> take in the order of 10 min (no mpi, maybe with OMP_NUM_THREAD=2)
>
>
>
> On 10/17/2013 04:33 PM, Yundi Quan wrote:
>
>> Thanks for your reply.
>> a). both machines are set up in a way that once a node is assigned to a
>> job, it cannot be assigned to another.
>> b). The .machines file looks like this
>> 1:node1
>> 1:node2
>> 1:node3
>> 1:node4
>> 1:node5
>> 1:node6
>> 1:node7
>> 1:node8
>> granularity:1
>> extrafine:1
>> lapw2_vector_split:1
>>
>> I've been trying to avoid using mpi because sometime mpi can slow down
>> my calculations because of poor communications between nodes.
>>
>> c). the amount of memory available to a core does not seem to be the
>> problem in my case because my job could run smoothly on cluster A where
>> each node has 8G memory and 8 core). But my job runs into memory
>> problems on cluster B where each core has much more memory available. I
>> wonder whether there are parameters which I should change in WIEN2k to
>> reduce the memory usage.
>>
>> d). My dayfile for a single iteration looks like this. The wallclocks
>> are around 500.
>>
>>
>> cycle 1 (Fri Oct 11 02:14:05 PDT 2013) (40/99 to go)
>>
>> > lapw0 -p(02:14:05) starting parallel lapw0 at Fri Oct 11 02:14:06
>>
>> PDT 2013
>> -------- .machine0 : processors
>> running lapw0 in single mode
>> 1431.414u 22.267s 24:14.84 99.9%0+0k 0+0io 0pf+0w
>> > lapw1 -up -p -c(02:38:20) starting parallel lapw1 at Fri Oct 11
>>
>> 02:38:20 PDT 2013
>> -> starting parallel LAPW1 jobs at Fri Oct 11 02:38:21 PDT 2013
>> running LAPW1 in parallel mode (using .machines)
>> 8 number_of_parallel_jobs
>> c1208-ib(1) 26558.265u 17.956s 7:34:14.39 97.5%0+0k 0+0io 0pf+0w
>> c1201-ib(1) 26845.212u 15.496s 7:39:59.37 97.3%0+0k 0+0io 0pf+0w
>> c1180-ib(1) 25872.609u 18.143s 7:23:53.43 97.2%0+0k 0+0io 0pf+0w
>> c1179-ib(1) 26040.482u 17.868s 7:26:38.66 97.2%0+0k 0+0io 0pf+0w
>> c1178-ib(1) 26571.271u 17.946s 7:34:16.23 97.5%0+0k 0+0io 0pf+0w
>> c1177-ib(1) 27108.070u 34.294s 8:32:55.53 88.1%0+0k 0+0io 0pf+0w
>> c1171-ib(1) 26729.399u 14.175s 7:36:22.67 97.6%0+0k 0+0io 0pf+0w
>> c0844-ib(1) 25883.863u 47.148s 8:12:35.54 87.7%0+0k 0+0io 0pf+0w
>>
>> Summary of lapw1para:
>> c1208-ibk=1user=26558.**3wallclock=454
>> c1201-ibk=1user=26845.**2wallclock=459
>> c1180-ibk=1user=25872.**6wallclock=443
>> c1179-ibk=1user=26040.**5wallclock=446
>> c1178-ibk=1user=26571.**3wallclock=454
>> c1177-ibk=1user=27108.**1wallclock=512
>> c1171-ibk=1user=26729.**4wallclock=456
>> c0844-ibk=1user=25883.**9wallclock=492
>> 97.935u 34.265s 8:32:58.38 0.4%0+0k 0+0io 0pf+0w
>> > lapw1 -dn -p -c(11:11:19) starting parallel lapw1 at Fri Oct 11
>>
>> 11:11:19 PDT 2013
>> -> starting parallel LAPW1 jobs at Fri Oct 11 11:11:19 PDT 2013
>> running LAPW1 in parallel mode (using .machines.help)
>> 8 number_of_parallel_jobs
>> c1208-ib(1) 26474.686u 16.142s 7:33:36.01 97.3%0+0k 0+0io 0pf+0w
>> c1201-ib(1) 26099.149u 40.330s 8:04:42.58 89.8%0+0k 0+0io 0pf+0w
>> c1180-ib(1) 26809.287u 14.724s 7:38:56.52 97.4%0+0k 0+0io 0pf+0w
>> c1179-ib(1) 26007.527u 17.959s 7:26:10.62 97.2%0+0k 0+0io 0pf+0w
>> c1178-ib(1) 26565.723u 17.576s 7:35:20.11 97.3%0+0k 0+0io 0pf+0w
>> c1177-ib(1) 27114.619u 31.180s 8:21:28.34 90.2%0+0k 0+0io 0pf+0w
>> c1171-ib(1) 26474.665u 15.309s 7:33:38.15 97.3%0+0k 0+0io 0pf+0w
>> c0844-ib(1) 26586.569u 15.010s 7:35:22.88 97.3%0+0k 0+0io 0pf+0w
>> Summary of lapw1para:
>> c1208-ibk=1user=26474.**7wallclock=453
>> c1201-ibk=1user=26099.**1wallclock=484
>> c1180-ibk=1user=26809.**3wallclock=458
>> c1179-ibk=1user=26007.**5wallclock=446
>> c1178-ibk=1user=26565.**7wallclock=455
>> c1177-ibk=1user=27114.**6wallclock=501
>> c1171-ibk=1user=26474.**7wallclock=453
>> c0844-ibk=1user=26586.**6wallclock=455
>> 104.607u 18.798s 8:21:30.92 0.4%0+0k 0+0io 0pf+0w
>>
>> > lapw2 -up -p -c (19:32:50) running LAPW2 in parallel mode
>> c1208-ib 1016.517u 13.674s 17:11.10 99.9% 0+0k 0+0io 0pf+0w
>> c1201-ib 1017.359u 13.669s 17:11.82 99.9% 0+0k 0+0io 0pf+0w
>> c1180-ib 1033.056u 13.283s 17:27.07 99.9% 0+0k 0+0io 0pf+0w
>> c1179-ib 1037.551u 13.447s 17:31.50 99.9% 0+0k 0+0io 0pf+0w
>> c1178-ib 1019.156u 13.729s 17:13.49 99.9% 0+0k 0+0io 0pf+0w
>> c1177-ib 1021.878u 13.731s 17:16.07 99.9% 0+0k 0+0io 0pf+0w
>> c1171-ib 1032.417u 13.681s 17:26.70 99.9% 0+0k 0+0io 0pf+0w
>> c0844-ib 1022.315u 13.870s 17:16.81 99.9% 0+0k 0+0io 0pf+0w
>> Summary of lapw2para:
>> c1208-ibuser=1016.52wallclock=**1031.1
>> c1201-ibuser=1017.36wallclock=**1031.82
>> c1180-ibuser=1033.06wallclock=**1047.07
>> c1179-ibuser=1037.55wallclock=**1051.5
>> c1178-ibuser=1019.16wallclock=**1033.49
>> c1177-ibuser=1021.88wallclock=**1036.07
>> c1171-ibuser=1032.42wallclock=**1046.7
>> c0844-ibuser=1022.32wallclock=**1036.81
>> 31.923u 13.526s 18:20.12 4.1%0+0k 0+0io 0pf+0w
>>
>> > lapw2 -dn -p -c (19:51:10) running LAPW2 in parallel mode
>> c1208-ib 947.942u 13.364s 16:01.75 99.9% 0+0k 0+0io 0pf+0w
>> c1201-ib 932.766u 13.640s 15:49.22 99.7% 0+0k 0+0io 0pf+0w
>> c1180-ib 932.474u 13.609s 15:47.76 99.8% 0+0k 0+0io 0pf+0w
>> c1179-ib 936.171u 13.691s 15:50.33 99.9% 0+0k 0+0io 0pf+0w
>> c1178-ib 947.798u 13.493s 16:04.99 99.6% 0+0k 0+0io 0pf+0w
>> c1177-ib 947.786u 13.350s 16:04.89 99.6% 0+0k 0+0io 0pf+0w
>> c1171-ib 930.971u 13.874s 15:45.22 99.9% 0+0k 0+0io 0pf+0w
>> c0844-ib 950.723u 13.426s 16:04.69 99.9% 0+0k 0+0io 0pf+0w
>> Summary of lapw2para:
>> c1208-ibuser=947.942wallclock=**961.75
>> c1201-ibuser=932.766wallclock=**949.22
>> c1180-ibuser=932.474wallclock=**947.76
>> c1179-ibuser=936.171wallclock=**950.33
>> c1178-ibuser=947.798wallclock=**964.99
>> c1177-ibuser=947.786wallclock=**964.89
>> c1171-ibuser=930.971wallclock=**945.22
>> c0844-ibuser=950.723wallclock=**964.69
>> 31.522u 13.879s 16:53.13 4.4%0+0k 0+0io 0pf+0w
>> > lcore -up(20:08:03) 2.993u 0.587s 0:03.75 95.2%0+0k 0+0io 0pf+0w
>> > lcore -dn(20:08:07) 2.843u 0.687s 0:03.66 96.1%0+0k 0+0io 0pf+0w
>> > mixer (20:08:21) 23.206u 32.513s 0:56.63 98.3%0+0k 0+0io 0pf+0w
>>
>> :ENERGY convergence: 0 0.00001 416.9302585700000000
>> :CHARGE convergence: 0 0.0000 3.6278086
>>
>>
>> On Thu, Oct 17, 2013 at 7:11 AM, Laurence Marks
>> <L-marks at northwestern.edu <mailto:L-marks at northwestern.**edu<L-marks at northwestern.edu>>>
>> wrote:
>>
>> There are so many possibilities, a few:
>>
>> a) If you only request 1 core/node most queuing systems (qsub/msub
>> etc) will allocate the other cores to other jobs. You are then going
>> to be very dependent upon what those other jobs are doing. Normal is
>> to use all the cores on a given node.
>>
>> b) When you run on cluster B, in addition to a) it is going to be
>> inefficient to run with mpi communications across nodes and it is much
>> better to run on a given node across cores. Are you using a machines
>> file with eight 1: nodeA lines (for instance) or one with a single 1:
>> nodeA nodeB....? The first does not use mpi, the second does. To use
>> mpi within a node you would use lines such as 1:node:8. Knowledge of
>> your .machines file will help people assist you.
>>
>> c) The memory on those clusters is very small, whoever bought them was
>> not thinking about large scale jobs. I look for at least 4G/core, and
>> 2G/core is barely acceptable. You are going to have to use mpi.
>>
>> d) All mpi is equal, but some mpi is more equal than others. Depending
>> upon whether you have infiniband, ethernet, openmpi, impi and how
>> everything was compiled you can see enormous differences. One thing to
>> look at is the difference between the cpu time and wall time (both in
>> case.dayfile and at the bottom of case.output1_*). With a good mpi
>> setup the wall time should be 5-10% more than the cpu time; with a bad
>> setup it can be several times it.
>>
>> On Thu, Oct 17, 2013 at 8:44 AM, Yundi Quan <quanyundi at gmail.com
>> <mailto:quanyundi at gmail.com>> wrote:
>> > Hi,
>> > I have access to two clusters as a low-level user. One cluster
>> (cluster A)
>> > consists of nodes with 8 core and 8 G mem per node. The other
>> cluster
>> > (cluster B) has 24G mem per node and each node has 14 cores or
>> more. The
>> > cores on cluster A are Xeon CPU E5620 at 2.40GHz, while the cores on
>> cluster B
>> > are Xeon CPU X5550 at 2.67GH. From the specifications (2.40GHz+12288
>> KB cache
>> > vs 2.67GHz+8192 KB cache), two machines should be very close in
>> performance.
>> > But it does not seem to be so.
>> >
>> > I have job with 72 atoms per unit cell. I initialized the job on
>> cluster A
>> > and ran it for a few iterations. Each iteration took 2 hours.
>> Then, I moved
>> > the job to cluster B (14 cores per node with @2.67GHz). Now it
>> takes more
>> > than 8 hours to finish one iteration. On both clusters, I request
>> one core
>> > per node and 8 nodes per job ( 8 is the number of k points). I
>> compiled
>> > WIEN2k_13 on cluster A without mpi. On cluster B, WIEN2k_12 was
>> compiled by
>> > the administrator with mpi.
>> >
>> > What could have caused poor performance of cluster B? Is it
>> because of MPI?
>> >
>> > On an unrelated question. Sometimes memory would run out on
>> cluster B which
>> > has 24Gmem per node. Nevertheless the same job could run smoothly
>> on cluster
>> > A which only has 8 G per node.
>> >
>> > Thanks.
>>
>>
>>
>> --
>> Professor Laurence Marks
>> Department of Materials Science and Engineering
>> Northwestern University
>> www.numis.northwestern.edu <http://www.numis.**northwestern.edu<http://www.numis.northwestern.edu>
>> >
>> 1-847-491-3996 <tel:1-847-491-3996>
>>
>> "Research is to see what everybody else has seen, and to think what
>> nobody else has thought"
>> Albert Szent-Gyorgi
>> ______________________________**_________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.**at <Wien at zeus.theochem.tuwien.ac.at><mailto:
>> Wien at zeus.theochem.**tuwien.ac.at <Wien at zeus.theochem.tuwien.ac.at>>
>>
>> http://zeus.theochem.tuwien.**ac.at/mailman/listinfo/wien<http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
>> SEARCH the MAILING-LIST at:
>> http://www.mail-archive.com/**wien@zeus.theochem.tuwien.ac.**
>> at/index.html<http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html>
>>
>>
>>
>>
>> ______________________________**_________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.**at <Wien at zeus.theochem.tuwien.ac.at>
>> http://zeus.theochem.tuwien.**ac.at/mailman/listinfo/wien<http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
>> SEARCH the MAILING-LIST at: http://www.mail-archive.com/**
>> wien at zeus.theochem.tuwien.ac.**at/index.html<http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html>
>>
>>
> --
>
> P.Blaha
> ------------------------------**------------------------------**
> --------------
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
> Email: blaha at theochem.tuwien.ac.at WWW: http://info.tuwien.ac.at/**
> theochem/ <http://info.tuwien.ac.at/theochem/>
> ------------------------------**------------------------------**
> --------------
>
> ______________________________**_________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.**at <Wien at zeus.theochem.tuwien.ac.at>
> http://zeus.theochem.tuwien.**ac.at/mailman/listinfo/wien<http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/**
> wien at zeus.theochem.tuwien.ac.**at/index.html<http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20131017/cac320e5/attachment.htm>
More information about the Wien
mailing list