[Wien] Intel(R) Xeon(R) CPU X5550 @ 2.67GHz vs Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

Laurence Marks L-marks at northwestern.edu
Thu Oct 17 16:50:08 CEST 2013


I assume the dayfile was for cluster A, as wall is about 8x cpu which
is about right for mkl multithreading which you are presumably using.
You are not using mpi. You may want to compare the wall time to using
on cluster A

1:node1:8

depending upon many factors it may be faster, or slower. This is only
doing mpi using the bus not between nodes.

Is it 72 unique atoms, or 72 total?

My guess is that cluster A is about right. You can make it faster by
using iterative diagonalization (-it or -it -noHinv) and perhaps
reducing RKMAX -- you don't say what your RMTs are.

For cluster B what blas/lapack are you using? Does it really have that
many cores/node or is it using hyperthreading (which does not really
give you much)? How is your NFS structured -- good communications or
just slow ethernet?


On Thu, Oct 17, 2013 at 9:33 AM, Yundi Quan <quan at ms.physics.ucdavis.edu> wrote:
> Thanks for your reply.
> a). both machines are set up in a way that once a node is assigned to a job,
> it cannot be assigned to another.
> b). The .machines file looks like this
> 1:node1
> 1:node2
> 1:node3
> 1:node4
> 1:node5
> 1:node6
> 1:node7
> 1:node8
> granularity:1
> extrafine:1
> lapw2_vector_split:1
>
> I've been trying to avoid using mpi because sometime mpi can slow down my
> calculations because of poor communications between nodes.
>
> c). the amount of memory available to a core does not seem to be the problem
> in my case because my job could run smoothly on cluster A where each node
> has 8G memory and 8 core). But my job runs into memory problems on cluster B
> where each core has much more memory available. I wonder whether there are
> parameters which I should change in WIEN2k to reduce the memory usage.
>
> d). My dayfile for a single iteration looks like this. The wallclocks are
> around 500.
>
>
>     cycle 1 (Fri Oct 11 02:14:05 PDT 2013) (40/99 to go)
>
>>   lapw0 -p (02:14:05) starting parallel lapw0 at Fri Oct 11 02:14:06 PDT
>> 2013
> -------- .machine0 : processors
> running lapw0 in single mode
> 1431.414u 22.267s 24:14.84 99.9% 0+0k 0+0io 0pf+0w
>>   lapw1  -up -p    -c (02:38:20) starting parallel lapw1 at Fri Oct 11
>> 02:38:20 PDT 2013
> ->  starting parallel LAPW1 jobs at Fri Oct 11 02:38:21 PDT 2013
> running LAPW1 in parallel mode (using .machines)
> 8 number_of_parallel_jobs
>      c1208-ib(1) 26558.265u 17.956s 7:34:14.39 97.5% 0+0k 0+0io 0pf+0w
>      c1201-ib(1) 26845.212u 15.496s 7:39:59.37 97.3% 0+0k 0+0io 0pf+0w
>      c1180-ib(1) 25872.609u 18.143s 7:23:53.43 97.2% 0+0k 0+0io 0pf+0w
>      c1179-ib(1) 26040.482u 17.868s 7:26:38.66 97.2% 0+0k 0+0io 0pf+0w
>      c1178-ib(1) 26571.271u 17.946s 7:34:16.23 97.5% 0+0k 0+0io 0pf+0w
>      c1177-ib(1) 27108.070u 34.294s 8:32:55.53 88.1% 0+0k 0+0io 0pf+0w
>      c1171-ib(1) 26729.399u 14.175s 7:36:22.67 97.6% 0+0k 0+0io 0pf+0w
>      c0844-ib(1) 25883.863u 47.148s 8:12:35.54 87.7% 0+0k 0+0io 0pf+0w
>    Summary of lapw1para:
>    c1208-ib k=1 user=26558.3 wallclock=454
>    c1201-ib k=1 user=26845.2 wallclock=459
>    c1180-ib k=1 user=25872.6 wallclock=443
>    c1179-ib k=1 user=26040.5 wallclock=446
>    c1178-ib k=1 user=26571.3 wallclock=454
>    c1177-ib k=1 user=27108.1 wallclock=512
>    c1171-ib k=1 user=26729.4 wallclock=456
>    c0844-ib k=1 user=25883.9 wallclock=492
> 97.935u 34.265s 8:32:58.38 0.4% 0+0k 0+0io 0pf+0w
>>   lapw1  -dn -p    -c (11:11:19) starting parallel lapw1 at Fri Oct 11
>> 11:11:19 PDT 2013
> ->  starting parallel LAPW1 jobs at Fri Oct 11 11:11:19 PDT 2013
> running LAPW1 in parallel mode (using .machines.help)
> 8 number_of_parallel_jobs
>      c1208-ib(1) 26474.686u 16.142s 7:33:36.01 97.3% 0+0k 0+0io 0pf+0w
>      c1201-ib(1) 26099.149u 40.330s 8:04:42.58 89.8% 0+0k 0+0io 0pf+0w
>      c1180-ib(1) 26809.287u 14.724s 7:38:56.52 97.4% 0+0k 0+0io 0pf+0w
>      c1179-ib(1) 26007.527u 17.959s 7:26:10.62 97.2% 0+0k 0+0io 0pf+0w
>      c1178-ib(1) 26565.723u 17.576s 7:35:20.11 97.3% 0+0k 0+0io 0pf+0w
>      c1177-ib(1) 27114.619u 31.180s 8:21:28.34 90.2% 0+0k 0+0io 0pf+0w
>      c1171-ib(1) 26474.665u 15.309s 7:33:38.15 97.3% 0+0k 0+0io 0pf+0w
>      c0844-ib(1) 26586.569u 15.010s 7:35:22.88 97.3% 0+0k 0+0io 0pf+0w
>    Summary of lapw1para:
>    c1208-ib k=1 user=26474.7 wallclock=453
>    c1201-ib k=1 user=26099.1 wallclock=484
>    c1180-ib k=1 user=26809.3 wallclock=458
>    c1179-ib k=1 user=26007.5 wallclock=446
>    c1178-ib k=1 user=26565.7 wallclock=455
>    c1177-ib k=1 user=27114.6 wallclock=501
>    c1171-ib k=1 user=26474.7 wallclock=453
>    c0844-ib k=1 user=26586.6 wallclock=455
> 104.607u 18.798s 8:21:30.92 0.4% 0+0k 0+0io 0pf+0w
>>   lapw2 -up -p   -c (19:32:50) running LAPW2 in parallel mode
>       c1208-ib 1016.517u 13.674s 17:11.10 99.9% 0+0k 0+0io 0pf+0w
>       c1201-ib 1017.359u 13.669s 17:11.82 99.9% 0+0k 0+0io 0pf+0w
>       c1180-ib 1033.056u 13.283s 17:27.07 99.9% 0+0k 0+0io 0pf+0w
>       c1179-ib 1037.551u 13.447s 17:31.50 99.9% 0+0k 0+0io 0pf+0w
>       c1178-ib 1019.156u 13.729s 17:13.49 99.9% 0+0k 0+0io 0pf+0w
>       c1177-ib 1021.878u 13.731s 17:16.07 99.9% 0+0k 0+0io 0pf+0w
>       c1171-ib 1032.417u 13.681s 17:26.70 99.9% 0+0k 0+0io 0pf+0w
>       c0844-ib 1022.315u 13.870s 17:16.81 99.9% 0+0k 0+0io 0pf+0w
>    Summary of lapw2para:
>    c1208-ib user=1016.52 wallclock=1031.1
>    c1201-ib user=1017.36 wallclock=1031.82
>    c1180-ib user=1033.06 wallclock=1047.07
>    c1179-ib user=1037.55 wallclock=1051.5
>    c1178-ib user=1019.16 wallclock=1033.49
>    c1177-ib user=1021.88 wallclock=1036.07
>    c1171-ib user=1032.42 wallclock=1046.7
>    c0844-ib user=1022.32 wallclock=1036.81
> 31.923u 13.526s 18:20.12 4.1% 0+0k 0+0io 0pf+0w
>>   lapw2 -dn -p   -c (19:51:10) running LAPW2 in parallel mode
>       c1208-ib 947.942u 13.364s 16:01.75 99.9% 0+0k 0+0io 0pf+0w
>       c1201-ib 932.766u 13.640s 15:49.22 99.7% 0+0k 0+0io 0pf+0w
>       c1180-ib 932.474u 13.609s 15:47.76 99.8% 0+0k 0+0io 0pf+0w
>       c1179-ib 936.171u 13.691s 15:50.33 99.9% 0+0k 0+0io 0pf+0w
>       c1178-ib 947.798u 13.493s 16:04.99 99.6% 0+0k 0+0io 0pf+0w
>       c1177-ib 947.786u 13.350s 16:04.89 99.6% 0+0k 0+0io 0pf+0w
>       c1171-ib 930.971u 13.874s 15:45.22 99.9% 0+0k 0+0io 0pf+0w
>       c0844-ib 950.723u 13.426s 16:04.69 99.9% 0+0k 0+0io 0pf+0w
>    Summary of lapw2para:
>    c1208-ib user=947.942 wallclock=961.75
>    c1201-ib user=932.766 wallclock=949.22
>    c1180-ib user=932.474 wallclock=947.76
>    c1179-ib user=936.171 wallclock=950.33
>    c1178-ib user=947.798 wallclock=964.99
>    c1177-ib user=947.786 wallclock=964.89
>    c1171-ib user=930.971 wallclock=945.22
>    c0844-ib user=950.723 wallclock=964.69
> 31.522u 13.879s 16:53.13 4.4% 0+0k 0+0io 0pf+0w
>>   lcore -up (20:08:03) 2.993u 0.587s 0:03.75 95.2% 0+0k 0+0io 0pf+0w
>>   lcore -dn (20:08:07) 2.843u 0.687s 0:03.66 96.1% 0+0k 0+0io 0pf+0w
>>   mixer   (20:08:21) 23.206u 32.513s 0:56.63 98.3% 0+0k 0+0io 0pf+0w
> :ENERGY convergence:  0 0.00001 416.9302585700000000
> :CHARGE convergence:  0 0.0000 3.6278086
>
>
> On Thu, Oct 17, 2013 at 7:11 AM, Laurence Marks <L-marks at northwestern.edu>
> wrote:
>>
>> There are so many possibilities, a few:
>>
>> a) If you only request 1 core/node most queuing systems (qsub/msub
>> etc) will allocate the other cores to other jobs. You are then going
>> to be very dependent upon what those other jobs are doing. Normal is
>> to use all the cores on a given node.
>>
>> b) When you run on cluster B, in addition to a) it is going to be
>> inefficient to run with mpi communications across nodes and it is much
>> better to run on a given node across cores. Are you using a machines
>> file with eight 1: nodeA lines (for instance) or one with a single 1:
>> nodeA nodeB....? The first does not use mpi, the second does. To use
>> mpi within a node you would use lines such as 1:node:8. Knowledge of
>> your .machines file will help people assist you.
>>
>> c) The memory on those clusters is very small, whoever bought them was
>> not thinking about large scale jobs. I look for at least 4G/core, and
>> 2G/core is barely acceptable. You are going to have to use mpi.
>>
>> d) All mpi is equal, but some mpi is more equal than others. Depending
>> upon whether you have infiniband, ethernet, openmpi, impi and how
>> everything was compiled you can see enormous differences. One thing to
>> look at is the difference between the cpu time and wall time (both in
>> case.dayfile and at the bottom of case.output1_*). With a good mpi
>> setup the wall time should be 5-10% more than the cpu time; with a bad
>> setup it can be several times it.
>>
>> On Thu, Oct 17, 2013 at 8:44 AM, Yundi Quan <quanyundi at gmail.com> wrote:
>> > Hi,
>> > I have access to two clusters as a low-level user. One cluster (cluster
>> > A)
>> > consists of nodes with 8 core and 8 G mem per node. The other cluster
>> > (cluster B) has 24G mem per node and each node has 14 cores or more. The
>> > cores on cluster A are Xeon CPU E5620 at 2.40GHz, while the cores on
>> > cluster B
>> > are Xeon CPU X5550 at 2.67GH. From the specifications (2.40GHz+12288 KB
>> > cache
>> > vs 2.67GHz+8192 KB cache), two machines should be very close in
>> > performance.
>> > But it does not seem to be so.
>> >
>> > I have job with 72 atoms per unit cell. I initialized the job on cluster
>> > A
>> > and ran it for a few iterations. Each iteration took 2 hours. Then, I
>> > moved
>> > the job to cluster B (14 cores per node with @2.67GHz). Now it takes
>> > more
>> > than 8 hours to finish one iteration. On both clusters, I request one
>> > core
>> > per node and 8 nodes per job ( 8 is the number of k points). I compiled
>> > WIEN2k_13 on cluster A without mpi. On cluster B, WIEN2k_12 was compiled
>> > by
>> > the administrator with mpi.
>> >
>> > What could have caused poor performance of cluster B? Is it because of
>> > MPI?
>> >
>> > On an unrelated question. Sometimes memory would run out on cluster B
>> > which
>> > has 24Gmem per node. Nevertheless the same job could run smoothly on
>> > cluster
>> > A which only has 8 G per node.
>> >
>> > Thanks.
>>
>>
>>
>> --
>> Professor Laurence Marks
>> Department of Materials Science and Engineering
>> Northwestern University
>> www.numis.northwestern.edu 1-847-491-3996
>> "Research is to see what everybody else has seen, and to think what
>> nobody else has thought"
>> Albert Szent-Gyorgi
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at:
>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>



-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what
nobody else has thought"
Albert Szent-Gyorgi


More information about the Wien mailing list