[Wien] LAPW1 crash on cycle 4

Peter Blaha pblaha at theochem.tuwien.ac.at
Tue May 29 07:52:25 CEST 2012


You wrote that your cell contains just 16 atoms ?  Which atoms ?

What are your RMT values ? What is your RKmax ? Do you have inversion symmetry ?

Usually 16 atom cells should not overload your computer, unless you do something
very stupid.

I'm wondering why lapw0 took 37 seconds ? Seems too much for 16 atoms.

Did you try to run it non-parallel ? Or with only 2 parallel tasks ?

Where are your data ? On the same computer or do you access the directory via NFS ?


Am 28.05.2012 16:55, schrieb Marcelo Barbosa:
> Thank you very much for your answer.
>
> I used htop to check the system while running and the swap (which says having 2Gb available) only uses ~40Mb during the calculation. The memory shows ~2Gb being used, so,
> is there any chance of being a swapping process going in & out of memory and htop not showing it?
>
> I'm using mkl installed with the Intel compiler version 11.1. How can i check if each lapw1 is trying to use 8 threads?
> I'm only using half of the available threads, so i thought i would be safe...
>
> Cheers,
> Marcelo Barbosa
>
> On May 14, 2012, at 2:19 PM, Laurence Marks wrote:
>
>> I suspect that nobody will be able to be very specific, beyond the obvious statement that you are overloading the computer. While you may only be using 2Gb for the Wien2k
>> jobs, the OS needs some so you may well be running out of memory. Did you check the swap space useage, and look to see if the processes are swapping in & out of memory?
>>
>> Also, if you are using mkl then each lapw1 task may be trying to use 8 threads. Depending upon how new the computer is hyperthreading may or may not be effecient.
>>
>> I suggest looking in the system logs, they might have some information, and use less tasks in parallel, e.g. 2. (And/or get more memory.)
>>
>> On Mon, May 14, 2012 at 6:46 AM, Marcelo Barbosa <marcelo.b.barbosa at gmail.com <mailto:marcelo.b.barbosa at gmail.com>> wrote:
>>
>>     Hello to you all
>>
>>     I'm trying to run a structure made of 16 atoms using 100 k-points (resulting in 12 k-points in the irreducible brillouin zone) on a machine with 4 cores with
>>     hyper-threaring, thus 8 threads available, and 4Gb of RAM.
>>
>>     I tried to run "run_lapw -p -fc 1 -NI" using a .machines file:
>>
>>     1:localhost
>>     1:localhost
>>     1:localhost
>>     1:localhost
>>     granularity:1
>>     extrafine:1
>>
>>     accessing only four threads but at LAPW1 in the cycle 4 i get this on the *.dayfile
>>
>>
>>     cycle 4 (Fri May 11 20:13:21 WEST 2012) (37/96 to go)
>>
>>     > lapw0 -p (20:13:21) starting parallel lapw0 at Fri May 11 20:13:21 WEST 2012
>>     -------- .machine0 : processors
>>     running lapw0 in single mode
>>     37.166u 0.361s 0:37.53 99.9% 0+0k 0+11944io 0pf+0w
>>     :FORCE convergence: 0 1 0 XCO 3.91 YCO 23.9 YCO 2.38 YCO 47.7 ZCO 45.0 ZCO 24.7 YCO 24.7 ZCO 50.6 YCO 4.12 YCO 10.1 ZCO 30.2 ZCO 3.51 YCO 3.27 YCO 5.20 ZCO 8.33 ZCO
>>     > lapw1 -c -p (20:14:00) starting parallel lapw1 at Fri May 11 20:14:00 WEST 2012
>>     -> starting parallel LAPW1 jobs at Fri May 11 20:14:00 WEST 2012
>>     running LAPW1 in parallel mode (using .machines)
>>     4 number_of_parallel_jobs
>>     [1] 26970
>>     [2] 27037
>>     [3] 27103
>>     [4] 27169
>>     [1] Done ( ( $remote $machine[$p] "cd $PWD;$t $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f
>>     .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>>
>>
>>
>>     And the computer completely crashes and i have to reboot it.
>>
>>     Do you have any idea of what might be happening?
>>     I thought it could be the lack of RAM, but until the end of the second cycle i was monitoring it with htop and it never got to use more that 2Gb of RAM, so i left it
>>     thinking there would be no problem.
>>     I used tmux to run this in the background, as i access the machine throw ssh.
>>
>>     Cheers,
>>     Marcelo Barbosa
>>
>>
>>
>>
>>     _______________________________________________
>>     Wien mailing list
>>     Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>
>>     http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>>
>>
>>
>> --
>> Professor Laurence Marks
>> Department of Materials Science and Engineering
>> Northwestern University
>> www.numis.northwestern.edu <http://www.numis.northwestern.edu/> 1-847-491-3996
>> "Research is to see what everybody else has seen, and to think what nobody else has thought"
>> Albert Szent-Gyorgi
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

-- 
-----------------------------------------
Peter Blaha
Inst. Materials Chemistry, TU Vienna
Getreidemarkt 9, A-1060 Vienna, Austria
Tel: +43-1-5880115671
Fax: +43-1-5880115698
email: pblaha at theochem.tuwien.ac.at
-----------------------------------------


More information about the Wien mailing list