[Wien] LAPW1 crash on cycle 4

Laurence Marks L-marks at northwestern.edu
Mon May 28 18:24:22 CEST 2012


N.B., you would have to install nmon, probably download from "somewhere".

On Mon, May 28, 2012 at 11:23 AM, Laurence Marks
<L-marks at northwestern.edu> wrote:
> At least with top you can look at the cpu column. If a process is
> using ~100% of the cpu (with top, hopefully htop is similar) then it
> is using 1 thread; one using 8 shows around 800%.
>
> Do you have ganglia installed? If you do then that shows swap nicely,
> and several other things. If not, try "man -k swap", probably not
> useful. A colleague swears by nmon which in principle can show you
> everything useful,  but I've never used it. It might be the fastest
> diagnostic, and probably easy to install. (If you use it and it is
> useful let me and others know.)
>
> N.B., I assume there was nothing useful in the system logs.
>
> On Mon, May 28, 2012 at 9:55 AM, Marcelo Barbosa
> <marcelo.b.barbosa at gmail.com> wrote:
>> Thank you very much for your answer.
>>
>> I used htop to check the system while running and the swap (which says
>> having 2Gb available) only uses ~40Mb during the calculation. The memory
>> shows ~2Gb being used, so, is there any chance of being a swapping process
>> going in & out of memory and htop not showing it?
>>
>> I'm using mkl installed with the Intel compiler version 11.1. How can i
>> check if each lapw1 is trying to use 8 threads?
>> I'm only using half of the available threads, so i thought i would be
>> safe...
>>
>> Cheers,
>> Marcelo Barbosa
>>
>> On May 14, 2012, at 2:19 PM, Laurence Marks wrote:
>>
>> I suspect that nobody will be able to be very specific, beyond the obvious
>> statement that you are overloading the computer. While you may only be using
>> 2Gb for the Wien2k jobs, the OS needs some so you may well be running out of
>> memory. Did you check the swap space useage, and look to see if the
>> processes are swapping in & out of memory?
>>
>> Also, if you are using mkl then each lapw1 task may be trying to use 8
>> threads. Depending upon how new the computer is hyperthreading may or may
>> not be effecient.
>>
>> I suggest looking in the system logs, they might have some information, and
>> use less tasks in parallel, e.g. 2. (And/or get more memory.)
>>
>> On Mon, May 14, 2012 at 6:46 AM, Marcelo Barbosa
>> <marcelo.b.barbosa at gmail.com> wrote:
>>>
>>> Hello to you all
>>>
>>> I'm trying to run a structure made of 16 atoms using 100 k-points
>>> (resulting in 12 k-points in the irreducible brillouin zone) on a machine
>>> with 4 cores with hyper-threaring, thus 8 threads available, and 4Gb of RAM.
>>>
>>> I tried to run "run_lapw -p -fc 1 -NI" using a .machines file:
>>>
>>> 1:localhost
>>> 1:localhost
>>> 1:localhost
>>> 1:localhost
>>> granularity:1
>>> extrafine:1
>>>
>>> accessing only four threads but at LAPW1 in the cycle 4 i get this on the
>>> *.dayfile
>>>
>>>
>>>    cycle 4     (Fri May 11 20:13:21 WEST 2012)         (37/96 to go)
>>>
>>> >   lapw0 -p    (20:13:21) starting parallel lapw0 at Fri May 11 20:13:21
>>> > WEST 2012
>>> -------- .machine0 : processors
>>> running lapw0 in single mode
>>> 37.166u 0.361s 0:37.53 99.9%    0+0k 0+11944io 0pf+0w
>>> :FORCE convergence: 0 1 0 XCO 3.91 YCO 23.9 YCO 2.38 YCO 47.7 ZCO 45.0 ZCO
>>> 24.7 YCO 24.7 ZCO 50.6 YCO 4.12 YCO 10.1 ZCO 30.2 ZCO 3.51 YCO 3.27 YCO 5.20
>>> ZCO 8.33 ZCO
>>> >   lapw1  -c -p        (20:14:00) starting parallel lapw1 at Fri May 11
>>> > 20:14:00 WEST 2012
>>> ->  starting parallel LAPW1 jobs at Fri May 11 20:14:00 WEST 2012
>>> running LAPW1 in parallel mode (using .machines)
>>> 4 number_of_parallel_jobs
>>> [1] 26970
>>> [2] 27037
>>> [3] 27103
>>> [4] 27169
>>> [1]    Done                          ( ( $remote $machine[$p] "cd $PWD;$t
>>> $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f
>>> .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >>
>>> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>>>
>>>
>>>
>>> And the computer completely crashes and i have to reboot it.
>>>
>>> Do you have any idea of what might be happening?
>>> I thought it could be the lack of RAM, but until the end of the second
>>> cycle i was monitoring it with htop and it never got to use more that 2Gb of
>>> RAM, so i left it thinking there would be no problem.
>>> I used tmux to run this in the background, as i access the machine throw
>>> ssh.
>>>
>>> Cheers,
>>> Marcelo Barbosa
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>>
>>
>>
>> --
>> Professor Laurence Marks
>> Department of Materials Science and Engineering
>> Northwestern University
>> www.numis.northwestern.edu 1-847-491-3996
>> "Research is to see what everybody else has seen, and to think what nobody
>> else has thought"
>> Albert Szent-Gyorgi
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>>
>
>
>
> --
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> www.numis.northwestern.edu 1-847-491-3996
> "Research is to see what everybody else has seen, and to think what
> nobody else has thought"
> Albert Szent-Gyorgi



-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what
nobody else has thought"
Albert Szent-Gyorgi


More information about the Wien mailing list