[Wien] 96 atom system runs lapw0 quickly, lapw1c running for days

Wed Apr 1 16:34:16 CEST 2015

One line summary:  In my case the problem was using more memory than
existed, so the nodes were thrashing and not getting anything done.

Greetings,
  FonsPaul recently reported a 96 atom system running lapw0 running quickly,
but lapw1c having run for days without finishing.  (This was a hybrid
calculation.)   I have had similar situations with lapw0 running, but
lapw1(c) being stuck for PBE runs of about 50 atoms, but with small spheres
because the materials contain hydrogen.  In my case the problem was that the
parallel run was asking for more memory on the nodes than they had.  The
easiest way for me to check for that was top.  If things are going well,
each lapw1 was getting close to 100% cpu. (As long as only one thread was
running per core.  On one machine multiple threads were running per core, so
it was confusing to understand.)  Another way to check is something like
"vmstat 5 3".  If the last two lines show swapping, there is probably a
problem.

  In my case, I had quite a few k-points, so I just used less cores per
node.  (I requested all the cores on each node, but set up .machines to only
use some of them.)  There would be other ways to reduce memory usage,
though, I am not saying this is the right one.

Best,
  David

David Olmsted
Assistant Research Engineer
Materials Science and Engineering
210 Hearst Memorial Mining Building
University of California
Berkeley, CA 94720-1760