[Wien] Systematic slowing down of calculations with time

Laurence Marks L-marks at northwestern.edu
Tue Mar 19 16:47:15 CET 2013


I was very lucky; the issue is related to cached memory and running

sync; echo 3 > /proc/sys/vm/drop_caches

solved the problem.

(see http://www.hosting.com/support/linux/clear-memory-cache-on-linux-server
& http://www.linuxinsight.com/proc_sys_vm_drop_caches.html )

No idea why this occurred but obviously something (impi, mkl, ...) is
leading to some combination of clean caches, dentries and inodes
sitting in memory and degrading performance.

I will put an appropriate cron task in, others might want to talk to
their sys_admin if they ever see this.

On Tue, Mar 19, 2013 at 8:11 AM, Laurence Marks
<L-marks at northwestern.edu> wrote:
> I have a reproducible slowing down of calculations which appears to be
> in lapw1 due to something (memory leak,?) which is going to be hard to
> track down so I welcome suggestions.
>
> I first noticed it when one newish E5-2660 node was systematically
> running at ~1/2 the speed of others, reproducibly. After rebooting it
> went back to running at the same speed as others.
>
> I have now reproduced a systematic slowing down of lapw1 (I cannot see
> anything in lapw2) for a long calculation (-it -noHinv, but I don't
> think this matters). It is shown in the attached with the x axis
> iteration, the y axis time in minutes. (The image may get shuffled to
> a link by the listserver software.) Starting from ~ 7minutes the
> slowdown is approximately 8 seconds/iteration. This is a fairly big
> calculation with a matrix size of 45456 and 835m/core (virtual)
> running on 64 cores. There is no indication that this is
> communications related, the slowdown is in CPU and WALL remains very
> close to this.
>
> Obviously recompiling with debug on is not going to be a viable
> approach. Also a scatter debug strategy, for instance trying to add
> calls to release memory from mkl calls is going to be very painful as
> we are talking about ~1 day to test. Ideal is innovative ideas to
> trace down why it has gone slow.
>
> Ideas?
>
> For reference, I am using composer_xe_2013.2.146 and Intel impi. I
> don't see this on older E5410 nodes but I have not run enough
> iterations to notice.
>
> N.B., others might want to look in long recent runs to see if they
> also have evidence for this.
>
> --
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> www.numis.northwestern.edu 1-847-491-3996
> "Research is to see what everybody else has seen, and to think what
> nobody else has thought"
> Albert Szent-Gyorgi



-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what
nobody else has thought"
Albert Szent-Gyorgi


More information about the Wien mailing list