[Wien] Intel(R) Xeon(R) CPU X5550 @ 2.67GHz vs Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

Thu Oct 17 17:48:06 CEST 2013

Something is not right. I think I misread your dayfile and in fast mkl
threading is not active. Try something like  env | grep -e MKL . I
suspect that your job is just running on a single core.

On Thu, Oct 17, 2013 at 10:13 AM, Yundi Quan <quanyundi at gmail.com> wrote:
> Sorry that I didn't make it clear. The dayfile was for cluster B. As I said
> before, I always request one core per node and 8 nodes per job (number of k
> points).  I have 72 crystallographically non-equivalent atoms.
>
> On cluster B, I used the following R_LIB (LAPACK+BLAS) option to compile
> WIEN2k. -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -iomp5
>
>
> Yundi
>
>
> On Thu, Oct 17, 2013 at 7:50 AM, Laurence Marks <L-marks at northwestern.edu>
> wrote:
>>
>> I assume the dayfile was for cluster A, as wall is about 8x cpu which
>> is about right for mkl multithreading which you are presumably using.
>> You are not using mpi. You may want to compare the wall time to using
>> on cluster A
>>
>> 1:node1:8
>>
>> depending upon many factors it may be faster, or slower. This is only
>> doing mpi using the bus not between nodes.
>>
>> Is it 72 unique atoms, or 72 total?
>>
>> My guess is that cluster A is about right. You can make it faster by
>> using iterative diagonalization (-it or -it -noHinv) and perhaps
>> reducing RKMAX -- you don't say what your RMTs are.
>>
>> For cluster B what blas/lapack are you using? Does it really have that
>> many cores/node or is it using hyperthreading (which does not really
>> give you much)? How is your NFS structured -- good communications or
>> just slow ethernet?
>>
>>
>> On Thu, Oct 17, 2013 at 9:33 AM, Yundi Quan <quan at ms.physics.ucdavis.edu>
>> wrote:
>> > Thanks for your reply.
>> > a). both machines are set up in a way that once a node is assigned to a
>> > job,
>> > it cannot be assigned to another.
>> > b). The .machines file looks like this
>> > 1:node1
>> > 1:node2
>> > 1:node3
>> > 1:node4
>> > 1:node5
>> > 1:node6
>> > 1:node7
>> > 1:node8
>> > granularity:1
>> > extrafine:1
>> > lapw2_vector_split:1
>> >
>> > I've been trying to avoid using mpi because sometime mpi can slow down
>> > my
>> > calculations because of poor communications between nodes.
>> >
>> > c). the amount of memory available to a core does not seem to be the
>> > problem
>> > in my case because my job could run smoothly on cluster A where each
>> > node
>> > has 8G memory and 8 core). But my job runs into memory problems on
>> > cluster B
>> > where each core has much more memory available. I wonder whether there
>> > are
>> > parameters which I should change in WIEN2k to reduce the memory usage.
>> >
>> > d). My dayfile for a single iteration looks like this. The wallclocks
>> > are
>> > around 500.
>> >
>> >
>> >     cycle 1 (Fri Oct 11 02:14:05 PDT 2013) (40/99 to go)
>> >
>> >>   lapw0 -p (02:14:05) starting parallel lapw0 at Fri Oct 11 02:14:06
>> >> PDT
>> >> 2013
>> > -------- .machine0 : processors
>> > running lapw0 in single mode
>> > 1431.414u 22.267s 24:14.84 99.9% 0+0k 0+0io 0pf+0w
>> >>   lapw1  -up -p    -c (02:38:20) starting parallel lapw1 at Fri Oct 11
>> >> 02:38:20 PDT 2013
>> > ->  starting parallel LAPW1 jobs at Fri Oct 11 02:38:21 PDT 2013
>> > running LAPW1 in parallel mode (using .machines)
>> > 8 number_of_parallel_jobs
>> >      c1208-ib(1) 26558.265u 17.956s 7:34:14.39 97.5% 0+0k 0+0io 0pf+0w
>> >      c1201-ib(1) 26845.212u 15.496s 7:39:59.37 97.3% 0+0k 0+0io 0pf+0w
>> >      c1180-ib(1) 25872.609u 18.143s 7:23:53.43 97.2% 0+0k 0+0io 0pf+0w
>> >      c1179-ib(1) 26040.482u 17.868s 7:26:38.66 97.2% 0+0k 0+0io 0pf+0w
>> >      c1178-ib(1) 26571.271u 17.946s 7:34:16.23 97.5% 0+0k 0+0io 0pf+0w
>> >      c1177-ib(1) 27108.070u 34.294s 8:32:55.53 88.1% 0+0k 0+0io 0pf+0w
>> >      c1171-ib(1) 26729.399u 14.175s 7:36:22.67 97.6% 0+0k 0+0io 0pf+0w
>> >      c0844-ib(1) 25883.863u 47.148s 8:12:35.54 87.7% 0+0k 0+0io 0pf+0w
>> >    Summary of lapw1para:
>> >    c1208-ib k=1 user=26558.3 wallclock=454
>> >    c1201-ib k=1 user=26845.2 wallclock=459
>> >    c1180-ib k=1 user=25872.6 wallclock=443
>> >    c1179-ib k=1 user=26040.5 wallclock=446
>> >    c1178-ib k=1 user=26571.3 wallclock=454
>> >    c1177-ib k=1 user=27108.1 wallclock=512
>> >    c1171-ib k=1 user=26729.4 wallclock=456
>> >    c0844-ib k=1 user=25883.9 wallclock=492
>> > 97.935u 34.265s 8:32:58.38 0.4% 0+0k 0+0io 0pf+0w
>> >>   lapw1  -dn -p    -c (11:11:19) starting parallel lapw1 at Fri Oct 11
>> >> 11:11:19 PDT 2013
>> > ->  starting parallel LAPW1 jobs at Fri Oct 11 11:11:19 PDT 2013
>> > running LAPW1 in parallel mode (using .machines.help)
>> > 8 number_of_parallel_jobs
>> >      c1208-ib(1) 26474.686u 16.142s 7:33:36.01 97.3% 0+0k 0+0io 0pf+0w
>> >      c1201-ib(1) 26099.149u 40.330s 8:04:42.58 89.8% 0+0k 0+0io 0pf+0w
>> >      c1180-ib(1) 26809.287u 14.724s 7:38:56.52 97.4% 0+0k 0+0io 0pf+0w
>> >      c1179-ib(1) 26007.527u 17.959s 7:26:10.62 97.2% 0+0k 0+0io 0pf+0w
>> >      c1178-ib(1) 26565.723u 17.576s 7:35:20.11 97.3% 0+0k 0+0io 0pf+0w
>> >      c1177-ib(1) 27114.619u 31.180s 8:21:28.34 90.2% 0+0k 0+0io 0pf+0w
>> >      c1171-ib(1) 26474.665u 15.309s 7:33:38.15 97.3% 0+0k 0+0io 0pf+0w
>> >      c0844-ib(1) 26586.569u 15.010s 7:35:22.88 97.3% 0+0k 0+0io 0pf+0w
>> >    Summary of lapw1para:
>> >    c1208-ib k=1 user=26474.7 wallclock=453
>> >    c1201-ib k=1 user=26099.1 wallclock=484
>> >    c1180-ib k=1 user=26809.3 wallclock=458
>> >    c1179-ib k=1 user=26007.5 wallclock=446
>> >    c1178-ib k=1 user=26565.7 wallclock=455
>> >    c1177-ib k=1 user=27114.6 wallclock=501
>> >    c1171-ib k=1 user=26474.7 wallclock=453
>> >    c0844-ib k=1 user=26586.6 wallclock=455
>> > 104.607u 18.798s 8:21:30.92 0.4% 0+0k 0+0io 0pf+0w
>> >>   lapw2 -up -p   -c (19:32:50) running LAPW2 in parallel mode
>> >       c1208-ib 1016.517u 13.674s 17:11.10 99.9% 0+0k 0+0io 0pf+0w
>> >       c1201-ib 1017.359u 13.669s 17:11.82 99.9% 0+0k 0+0io 0pf+0w
>> >       c1180-ib 1033.056u 13.283s 17:27.07 99.9% 0+0k 0+0io 0pf+0w
>> >       c1179-ib 1037.551u 13.447s 17:31.50 99.9% 0+0k 0+0io 0pf+0w
>> >       c1178-ib 1019.156u 13.729s 17:13.49 99.9% 0+0k 0+0io 0pf+0w
>> >       c1177-ib 1021.878u 13.731s 17:16.07 99.9% 0+0k 0+0io 0pf+0w
>> >       c1171-ib 1032.417u 13.681s 17:26.70 99.9% 0+0k 0+0io 0pf+0w
>> >       c0844-ib 1022.315u 13.870s 17:16.81 99.9% 0+0k 0+0io 0pf+0w
>> >    Summary of lapw2para:
>> >    c1208-ib user=1016.52 wallclock=1031.1
>> >    c1201-ib user=1017.36 wallclock=1031.82
>> >    c1180-ib user=1033.06 wallclock=1047.07
>> >    c1179-ib user=1037.55 wallclock=1051.5
>> >    c1178-ib user=1019.16 wallclock=1033.49
>> >    c1177-ib user=1021.88 wallclock=1036.07
>> >    c1171-ib user=1032.42 wallclock=1046.7
>> >    c0844-ib user=1022.32 wallclock=1036.81
>> > 31.923u 13.526s 18:20.12 4.1% 0+0k 0+0io 0pf+0w
>> >>   lapw2 -dn -p   -c (19:51:10) running LAPW2 in parallel mode
>> >       c1208-ib 947.942u 13.364s 16:01.75 99.9% 0+0k 0+0io 0pf+0w
>> >       c1201-ib 932.766u 13.640s 15:49.22 99.7% 0+0k 0+0io 0pf+0w
>> >       c1180-ib 932.474u 13.609s 15:47.76 99.8% 0+0k 0+0io 0pf+0w
>> >       c1179-ib 936.171u 13.691s 15:50.33 99.9% 0+0k 0+0io 0pf+0w
>> >       c1178-ib 947.798u 13.493s 16:04.99 99.6% 0+0k 0+0io 0pf+0w
>> >       c1177-ib 947.786u 13.350s 16:04.89 99.6% 0+0k 0+0io 0pf+0w
>> >       c1171-ib 930.971u 13.874s 15:45.22 99.9% 0+0k 0+0io 0pf+0w
>> >       c0844-ib 950.723u 13.426s 16:04.69 99.9% 0+0k 0+0io 0pf+0w
>> >    Summary of lapw2para:
>> >    c1208-ib user=947.942 wallclock=961.75
>> >    c1201-ib user=932.766 wallclock=949.22
>> >    c1180-ib user=932.474 wallclock=947.76
>> >    c1179-ib user=936.171 wallclock=950.33
>> >    c1178-ib user=947.798 wallclock=964.99
>> >    c1177-ib user=947.786 wallclock=964.89
>> >    c1171-ib user=930.971 wallclock=945.22
>> >    c0844-ib user=950.723 wallclock=964.69
>> > 31.522u 13.879s 16:53.13 4.4% 0+0k 0+0io 0pf+0w
>> >>   lcore -up (20:08:03) 2.993u 0.587s 0:03.75 95.2% 0+0k 0+0io 0pf+0w
>> >>   lcore -dn (20:08:07) 2.843u 0.687s 0:03.66 96.1% 0+0k 0+0io 0pf+0w
>> >>   mixer   (20:08:21) 23.206u 32.513s 0:56.63 98.3% 0+0k 0+0io 0pf+0w
>> > :ENERGY convergence:  0 0.00001 416.9302585700000000
>> > :CHARGE convergence:  0 0.0000 3.6278086
>> >
>> >
>> > On Thu, Oct 17, 2013 at 7:11 AM, Laurence Marks
>> > <L-marks at northwestern.edu>
>> > wrote:
>> >>
>> >> There are so many possibilities, a few:
>> >>
>> >> a) If you only request 1 core/node most queuing systems (qsub/msub
>> >> etc) will allocate the other cores to other jobs. You are then going
>> >> to be very dependent upon what those other jobs are doing. Normal is
>> >> to use all the cores on a given node.
>> >>
>> >> b) When you run on cluster B, in addition to a) it is going to be
>> >> inefficient to run with mpi communications across nodes and it is much
>> >> better to run on a given node across cores. Are you using a machines
>> >> file with eight 1: nodeA lines (for instance) or one with a single 1:
>> >> nodeA nodeB....? The first does not use mpi, the second does. To use
>> >> mpi within a node you would use lines such as 1:node:8. Knowledge of
>> >> your .machines file will help people assist you.
>> >>
>> >> c) The memory on those clusters is very small, whoever bought them was
>> >> not thinking about large scale jobs. I look for at least 4G/core, and
>> >> 2G/core is barely acceptable. You are going to have to use mpi.
>> >>
>> >> d) All mpi is equal, but some mpi is more equal than others. Depending
>> >> upon whether you have infiniband, ethernet, openmpi, impi and how
>> >> everything was compiled you can see enormous differences. One thing to
>> >> look at is the difference between the cpu time and wall time (both in
>> >> case.dayfile and at the bottom of case.output1_*). With a good mpi
>> >> setup the wall time should be 5-10% more than the cpu time; with a bad
>> >> setup it can be several times it.
>> >>
>> >> On Thu, Oct 17, 2013 at 8:44 AM, Yundi Quan <quanyundi at gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> > I have access to two clusters as a low-level user. One cluster
>> >> > (cluster
>> >> > A)
>> >> > consists of nodes with 8 core and 8 G mem per node. The other cluster
>> >> > (cluster B) has 24G mem per node and each node has 14 cores or more.
>> >> > The
>> >> > cores on cluster A are Xeon CPU E5620 at 2.40GHz, while the cores on
>> >> > cluster B
>> >> > are Xeon CPU X5550 at 2.67GH. From the specifications (2.40GHz+12288 KB
>> >> > cache
>> >> > vs 2.67GHz+8192 KB cache), two machines should be very close in
>> >> > performance.
>> >> > But it does not seem to be so.
>> >> >
>> >> > I have job with 72 atoms per unit cell. I initialized the job on
>> >> > cluster
>> >> > A
>> >> > and ran it for a few iterations. Each iteration took 2 hours. Then, I
>> >> > moved
>> >> > the job to cluster B (14 cores per node with @2.67GHz). Now it takes
>> >> > more
>> >> > than 8 hours to finish one iteration. On both clusters, I request one
>> >> > core
>> >> > per node and 8 nodes per job ( 8 is the number of k points). I
>> >> > compiled
>> >> > WIEN2k_13 on cluster A without mpi. On cluster B, WIEN2k_12 was
>> >> > compiled
>> >> > by
>> >> > the administrator with mpi.
>> >> >
>> >> > What could have caused poor performance of cluster B? Is it because
>> >> > of
>> >> > MPI?
>> >> >
>> >> > On an unrelated question. Sometimes memory would run out on cluster B
>> >> > which
>> >> > has 24Gmem per node. Nevertheless the same job could run smoothly on
>> >> > cluster
>> >> > A which only has 8 G per node.
>> >> >
>> >> > Thanks.
>> >>
>> >>
>> >>
>> >> --
>> >> Professor Laurence Marks
>> >> Department of Materials Science and Engineering
>> >> Northwestern University
>> >> www.numis.northwestern.edu 1-847-491-3996
>> >> "Research is to see what everybody else has seen, and to think what
>> >> nobody else has thought"
>> >> Albert Szent-Gyorgi
>> >> _______________________________________________
>> >> Wien mailing list
>> >> Wien at zeus.theochem.tuwien.ac.at
>> >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> >> SEARCH the MAILING-LIST at:
>> >> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>> >
>> >
>>
>>
>>
>> --
>> Professor Laurence Marks
>> Department of Materials Science and Engineering
>> Northwestern University
>> www.numis.northwestern.edu 1-847-491-3996
>> "Research is to see what everybody else has seen, and to think what
>> nobody else has thought"
>> Albert Szent-Gyorgi
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at:
>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>

-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what
nobody else has thought"
Albert Szent-Gyorgi