[Wien] excessive bandwidth usage
Laurence Marks
L-marks at northwestern.edu
Thu Apr 2 23:26:04 CEST 2009
N.B., the SCRATCH variable should be set to some local disk space,
e.g. /tmp/Annes-Space
On Thu, Apr 2, 2009 at 4:23 PM, Laurence Marks <L-marks at northwestern.edu> wrote:
> For your two questions:
> a) Look in the userguide (pdf will be easier for a search) for the
> SCRATCH environmental variable. You may be able to store the
> case.vector* files on each compute node if your system is setup so you
> can reach a temporary directory.
> b) In terms of setting your problem up better, no idea -- we would
> need to know more about what it is, e.g. RKM, # k-points....
>
> On Thu, Apr 2, 2009 at 10:49 AM, Anne-Christine Uldry
> <anne-christine.uldry at psi.ch> wrote:
>> Dear Wien2k users,
>>
>> I am wondering if anyone could comment or give me some pointers or general
>> info regarding the way wien2k deals with input/output with the
>> case.vector* files.
>>
>> I have recently tried to run -admittedly oversized- wien2k calculations on
>> our local cluster (details below). By the time my calculations reached
>> cycle 2 of the SCF, other users were complaining (and quite rightly so)
>> that I was taking up all the communication bandwidth doing I/O with the
>> several gigabytes worth of case.vector* files.
>>
>> This particular calculation is probably not appropriate for this cluster,
>> and/or badly set up.
>>
>> My first question is however this:
>> Is there something one can do to limit reading/writing to the
>> vector files in some instances; in my case it looked like the vectors
>> could be held in memory instead of being written down. Are there variables
>> or settings that could be adjusted to prevent checkpointing ?
>>
>> My second question, if anyone can be bothered to look at the details, is:
>> Could I set up this calculation in a better way at all ?
>>
>>
>> I am running wien2k version 09.1 on a fairly standard cluster. It has 24
>> compute nodes of 8 GB of RAM, carrying two AMD Opteron dual cores (2.4
>> GHz), and a one gigabit interconnect. The operating system is Scientific
>> linux 4. Maybe I should also mentionned that I have set ulimit -v to
>> 2000000 and ulimit -m to 2097152. OMP_NUM_THREADS is set to 1, and SCRATCH
>> points to our scratch disk (don't know the size, but I never had a storage
>> problem there).
>> I compiled for k-point parallelisation only, using intel compiler 10.0 64
>> bit and the Intel-MKL libraries. NMATMAX had been set to 13000, NUME to
>> 1000.
>> The system I looked at has 54 independent magnetic atoms in a 4129 au^3
>> supercell (no inversion symmetry). The matrix size was 6199 and I had 500
>> kpts. I requested 20 slots (out of a maximum of 32). The .machines file
>> looked like this:
>>
>> granularity:1
>>
>> plus 20 lines like this:
>>
>> 1:merlinXX:1
>>
>> The command issued was "runsp_lapw -p".
>> The case.dayfile is reproduced below for the first cycle. Note that during
>> lapw1 -c the processors seem to be used to close to 100 percent, while in
>> lapw2 -c I get more like 20 percent.
>>
>>
>> Many thanks and best wishes,
>> Anne-Christine Uldry
>>
>>
>>
>>
>> -------------------------case.dayfile------------------------------------
>>
>>
>>
>> start (Tue Mar 17 09:59:33 CET 2009) with lapw0 (500/99 to go)
>>
>> cycle 1 (Tue Mar 17 09:59:33 CET 2009) (500/99 to go)
>>
>>> lapw0 -p (09:59:33) starting parallel lapw0 at Tue Mar 17 09:59:33
>> CET 2009
>> -------- .machine0 : processors
>> running lapw0 in single mode
>> 343.032u 3.523s 5:48.37 99.4% 0+0k 24+0io 13pf+0w
>>> lapw1 -c -up -p (10:05:22) starting parallel lapw1 at Tue Mar 17
>> 10:05:22 CET 2009
>> -> starting parallel LAPW1 jobs at Tue Mar 17 10:05:23 CET 2009
>> running LAPW1 in parallel mode (using .machines)
>> 20 number_of_parallel_jobs
>> merlin16(25) 37495.767u 154.792s 10:27:55.51 99.93% 0+0k 0+0io
>> 0pf+0w
>> merlin05(25) 40045.286u 55.558s 11:08:35.54 99.96% 0+0k 0+0io
>> 0pf+0w
>> merlin13(25) 38207.258u 55.456s 10:38:03.97 99.94% 0+0k 0+0io
>> 0pf+0w
>> merlin20(25) 37391.182u 153.745s 10:25:58.52 99.96% 0+0k 0+0io
>> 0pf+0w
>> merlin17(25) 37232.878u 147.786s 10:23:11.98 99.97% 0+0k 0+0io
>> 0pf+0w
>> merlin15(25) 37528.695u 153.319s 10:28:14.53 99.97% 0+0k 0+0io
>> 0pf+0w
>> merlin03(25) 38355.185u 53.653s 10:40:24.28 99.96% 0+0k 0+0io
>> 0pf+0w
>> merlin21(25) 37258.630u 153.113s 10:23:42.03 99.97% 0+0k 0+0io
>> 0pf+0w
>> merlin06(25) 38954.480u 49.951s 10:50:16.45 99.97% 0+0k 0+0io
>> 0pf+0w
>> merlin18(25) 38570.612u 166.023s 10:45:49.15 99.97% 0+0k 0+0io
>> 0pf+0w
>> merlin22(25) 38776.189u 182.754s 10:49:48.24 99.92% 0+0k 0+0io
>> 0pf+0w
>> merlin24(25) 37065.451u 160.089s 10:20:35.9 99.97% 0+0k 0+0io
>> 0pf+0w
>> merlin01(25) 39738.144u 77.905s 11:04:15.48 99.90% 0+0k 0+0io
>> 0pf+0w
>> merlin10(25) 39316.326u 51.954s 10:56:19.00 99.97% 0+0k 0+0io
>> 0pf+0w
>> merlin19(25) 40010.995u 155.071s 11:09:38.35 99.97% 0+0k 0+0io
>> 0pf+0w
>> merlin04(25) 38738.890u 52.391s 10:46:42.77 99.97% 0+0k 0+0io
>> 0pf+0w
>> merlin12(25) 39349.068u 57.359s 10:57:09.63 99.94% 0+0k 0+0io
>> 0pf+0w
>> merlin07(25) 38638.182u 51.192s 10:45:02.37 99.97% 0+0k 0+0io
>> 0pf+0w
>> merlin09(25) 39349.753u 120.713s 10:58:03.42 99.97% 0+0k 0+0io
>> 0pf+0w
>> merlin14(25) 39202.712u 68.919s 10:54:52.43 99.95% 0+0k 0+0io
>> 0pf+0w
>> Summary of lapw1para:
>> merlin16 k=25 user=37495.8 wallclock=627
>> merlin05 k=25 user=40045.3 wallclock=668
>> merlin13 k=25 user=38207.3 wallclock=638
>> merlin20 k=25 user=37391.2 wallclock=625
>> merlin17 k=25 user=37232.9 wallclock=623
>> merlin15 k=25 user=37528.7 wallclock=628
>> merlin03 k=25 user=38355.2 wallclock=640
>> merlin21 k=25 user=37258.6 wallclock=623
>> merlin06 k=25 user=38954.5 wallclock=650
>> merlin18 k=25 user=38570.6 wallclock=645
>> merlin22 k=25 user=38776.2 wallclock=649
>> merlin24 k=25 user=37065.5 wallclock=620
>> merlin01 k=25 user=39738.1 wallclock=664
>> merlin10 k=25 user=39316.3 wallclock=656
>> merlin19 k=25 user=40011 wallclock=669
>> merlin04 k=25 user=38738.9 wallclock=646
>> merlin12 k=25 user=39349.1 wallclock=657
>> merlin07 k=25 user=38638.2 wallclock=645
>> merlin09 k=25 user=39349.8 wallclock=658
>> merlin14 k=25 user=39202.7 wallclock=654
>> 17.098u 72.346s 11:09:57.45 0.2% 0+0k 40+320io 0pf+0w
>>> lapw1 -c -dn -p (21:15:19) starting parallel lapw1 at Tue Mar 17
>> 21:15:19 CET 2009
>> -> starting parallel LAPW1 jobs at Tue Mar 17 21:15:20 CET 2009
>> running LAPW1 in parallel mode (using .machines.help)
>> 20 number_of_parallel_jobs
>> merlin16(25) 37751.881u 155.731s 10:32:11.94 99.94% 0+0k 0+0io
>> 0pf+0w
>> merlin05(25) 39472.295u 104.662s 10:59:52.02 99.96% 0+0k 0+0io
>> 0pf+0w
>> merlin13(25) 37835.579u 55.090s 10:31:56.91 99.93% 0+0k 0+0io
>> 0pf+0w
>> merlin20(25) 37918.886u 184.326s 10:35:26.09 99.94% 0+0k 0+0io
>> 0pf+0w
>> merlin17(25) 37265.070u 403.814s 10:28:33.48 99.88% 0+0k 0+0io
>> 0pf+0w
>> merlin15(25) 37835.077u 157.843s 10:33:24.46 99.97% 0+0k 0+0io
>> 0pf+0w
>> merlin03(25) 40200.683u 76.920s 11:11:48.85 99.92% 0+0k 0+0io
>> 0pf+0w
>> merlin21(25) 38055.082u 157.867s 10:37:08.89 99.96% 0+0k 0+0io
>> 0pf+0w
>> merlin06(25) 38471.013u 59.757s 10:42:41.02 99.92% 0+0k 0+0io
>> 0pf+0w
>> merlin18(25) 39028.706u 160.356s 10:53:24.49 99.96% 0+0k 0+0io
>> 0pf+0w
>> merlin22(25) 39672.912u 157.281s 11:04:14.35 99.94% 0+0k 0+0io
>> 0pf+0w
>> merlin24(25) 37535.676u 244.072s 10:30:33.84 99.86% 0+0k 0+0io
>> 0pf+0w
>> merlin01(25) 40867.680u 87.081s 11:23:18.53 99.89% 0+0k 0+0io
>> 0pf+0w
>> merlin10(25) 38712.416u 52.325s 10:46:19.60 99.96% 0+0k 0+0io
>> 0pf+0w
>> merlin19(25) 38589.740u 161.574s 10:46:03.82 99.97% 0+0k 0+0io
>> 0pf+0w
>> merlin04(25) 38711.808u 53.267s 10:46:18.66 99.96% 0+0k 0+0io
>> 0pf+0w
>> merlin12(25) 39539.575u 55.224s 11:16.21 99.95% 0+0k 0+0io
>> 0pf+0w
>> merlin07(25) 37432.873u 48.500s 10:24:56.60 99.96% 0+0k 0+0io
>> 0pf+0w
>> merlin09(25) 39396.568u 128.451s 10:59:07.64 99.94% 0+0k 0+0io
>> 0pf+0w
>> merlin14(25) 39549.837u 79.790s 11:01:06.82 99.91% 0+0k 0+0io
>> 0pf+0w
>> Summary of lapw1para:
>> merlin16 k=25 user=37751.9 wallclock=632
>> merlin05 k=25 user=39472.3 wallclock=659
>> merlin13 k=25 user=37835.6 wallclock=631
>> merlin20 k=25 user=37918.9 wallclock=635
>> merlin17 k=25 user=37265.1 wallclock=628
>> merlin15 k=25 user=37835.1 wallclock=633
>> merlin03 k=25 user=40200.7 wallclock=671
>> merlin21 k=25 user=38055.1 wallclock=637
>> merlin06 k=25 user=38471 wallclock=642
>> merlin18 k=25 user=39028.7 wallclock=653
>> merlin22 k=25 user=39672.9 wallclock=664
>> merlin24 k=25 user=37535.7 wallclock=630
>> merlin01 k=25 user=40867.7 wallclock=683
>> merlin10 k=25 user=38712.4 wallclock=646
>> merlin19 k=25 user=38589.7 wallclock=646
>> merlin04 k=25 user=38711.8 wallclock=646
>> merlin12 k=25 user=39539.6 wallclock=676.21
>> merlin07 k=25 user=37432.9 wallclock=624
>> merlin09 k=25 user=39396.6 wallclock=659
>> merlin14 k=25 user=39549.8 wallclock=661
>> 17.925u 74.156s 11:23:35.72 0.2% 0+0k 0+320io 0pf+0w
>>> lapw2 -c -up -p (08:38:55) running LAPW2 in parallel mode
>> merlin16 2510.963u 91.613s 2:46:13.78 26.09% 0+0k 0+0io 0pf+0w
>> merlin05 2493.665u 79.846s 2:46:09.00 25.82% 0+0k 0+0io 0pf+0w
>> merlin13 2484.150u 78.296s 2:46:03.32 25.72% 0+0k 0+0io 0pf+0w
>> merlin20 2479.015u 104.637s 2:46:24.73 25.88% 0+0k 0+0io 0pf+0w
>> merlin17 2461.499u 103.356s 2:46:18.27 25.70% 0+0k 0+0io 0pf+0w
>> merlin15 2474.984u 88.321s 2:46:14.32 25.70% 0+0k 0+0io 0pf+0w
>> merlin03 2544.354u 86.533s 2:47:15.34 26.22% 0+0k 0+0io 0pf+0w
>> merlin21 2510.726u 105.354s 2:46:52.52 26.13% 0+0k 0+0io 0pf+0w
>> merlin06 2519.390u 87.796s 2:47:29.42 25.94% 0+0k 0+0io 0pf+0w
>> merlin18 2529.690u 122.496s 2:47:29.15 26.39% 0+0k 0+0io 0pf+0w
>> merlin22 2468.111u 114.877s 2:47:14.86 25.74% 0+0k 0+0io 0pf+0w
>> merlin24 2473.606u 112.375s 2:46:11.20 25.93% 0+0k 0+0io 0pf+0w
>> merlin01 2495.088u 93.868s 2:47:32.19 25.76% 0+0k 0+0io 0pf+0w
>> merlin10 2438.887u 77.412s 2:46:59.86 25.11% 0+0k 0+0io 0pf+0w
>> merlin19 2521.003u 95.387s 2:46:56.37 26.12% 0+0k 0+0io 0pf+0w
>> merlin04 2484.324u 78.449s 2:46:46.20 25.61% 0+0k 0+0io 0pf+0w
>> merlin12 2591.029u 95.105s 2:47:18.48 26.76% 0+0k 0+0io 0pf+0w
>> merlin07 2427.397u 77.359s 2:46:30.55 25.07% 0+0k 0+0io 0pf+0w
>> merlin09 2443.942u 82.523s 2:46:29.46 25.29% 0+0k 0+0io 0pf+0w
>> merlin14 2471.389u 89.183s 2:46:57.65 25.56% 0+0k 0+0io 0pf+0w
>> Summary of lapw2para:
>> merlin16 user=2510.96 wallclock=166
>> merlin05 user=2493.66 wallclock=166
>> merlin13 user=2484.15 wallclock=166
>> merlin20 user=2479.01 wallclock=166
>> merlin17 user=2461.5 wallclock=166
>> merlin15 user=2474.98 wallclock=166
>> merlin03 user=2544.35 wallclock=167
>> merlin21 user=2510.73 wallclock=166
>> merlin06 user=2519.39 wallclock=167
>> merlin18 user=2529.69 wallclock=167
>> merlin22 user=2468.11 wallclock=167
>> merlin24 user=2473.61 wallclock=166
>> merlin01 user=2495.09 wallclock=167
>> merlin10 user=2438.89 wallclock=166
>> merlin19 user=2521 wallclock=166
>> merlin04 user=2484.32 wallclock=166
>> merlin12 user=2591.03 wallclock=167
>> merlin07 user=2427.4 wallclock=166
>> merlin09 user=2443.94 wallclock=166
>> merlin14 user=2471.39 wallclock=166
>> 36.088u 7.790s 2:48:38.21 0.4% 0+0k 24+160io 8pf+0w
>>> lapw2 -c -dn -p (11:27:33) running LAPW2 in parallel mode
>> merlin16 2156.929u 95.886s 2:45:56.30 22.63% 0+0k 0+0io 0pf+0w
>> merlin05 2135.272u 84.472s 2:45:27.61 22.36% 0+0k 0+0io 0pf+0w
>> merlin13 2071.188u 78.480s 2:45:21.64 21.67% 0+0k 0+0io 0pf+0w
>> merlin20 2151.438u 104.975s 2:46:04.56 22.64% 0+0k 0+0io 0pf+0w
>> merlin17 2133.444u 97.060s 2:45:30.52 22.46% 0+0k 0+0io 0pf+0w
>> merlin15 2081.077u 81.041s 2:45:48.49 21.73% 0+0k 0+0io 0pf+0w
>> merlin03 2137.855u 86.559s 2:46:28.13 22.27% 0+0k 0+0io 0pf+0w
>> merlin21 2093.567u 101.220s 2:46:17.25 22.00% 0+0k 0+0io 0pf+0w
>> merlin06 2143.250u 97.554s 2:46:54.45 22.38% 0+0k 0+0io 0pf+0w
>> merlin18 2084.752u 105.447s 2:46:26.67 21.93% 0+0k 0+0io 0pf+0w
>> merlin22 2082.295u 99.013s 2:46:43.88 21.80% 0+0k 0+0io 0pf+0w
>> merlin24 2072.152u 95.022s 2:45:44.16 21.79% 0+0k 0+0io 0pf+0w
>> merlin01 2132.420u 91.661s 2:46:35.91 22.25% 0+0k 0+0io 0pf+0w
>> merlin10 2118.587u 94.126s 2:46:39.64 22.13% 0+0k 0+0io 0pf+0w
>> merlin19 2102.943u 92.078s 2:46:20.96 21.99% 0+0k 0+0io 0pf+0w
>> merlin04 2089.082u 85.161s 2:46:19.56 21.79% 0+0k 0+0io 0pf+0w
>> merlin12 2144.932u 87.126s 2:46:19.10 22.37% 0+0k 0+0io 0pf+0w
>> merlin07 2084.597u 83.871s 2:45:57.54 21.78% 0+0k 0+0io 0pf+0w
>> merlin09 2051.034u 75.865s 2:45:49.55 21.38% 0+0k 0+0io 0pf+0w
>> merlin14 2061.305u 87.110s 2:46:01.46 21.57% 0+0k 0+0io 0pf+0w
>> Summary of lapw2para:
>> merlin16 user=2156.93 wallclock=165
>> merlin05 user=2135.27 wallclock=165
>> merlin13 user=2071.19 wallclock=165
>> merlin20 user=2151.44 wallclock=166
>> merlin17 user=2133.44 wallclock=165
>> merlin15 user=2081.08 wallclock=165
>> merlin03 user=2137.86 wallclock=166
>> merlin21 user=2093.57 wallclock=166
>> merlin06 user=2143.25 wallclock=166
>> merlin18 user=2084.75 wallclock=166
>> merlin22 user=2082.3 wallclock=166
>> merlin24 user=2072.15 wallclock=165
>> merlin01 user=2132.42 wallclock=166
>> merlin10 user=2118.59 wallclock=166
>> merlin19 user=2102.94 wallclock=166
>> merlin04 user=2089.08 wallclock=166
>> merlin12 user=2144.93 wallclock=166
>> merlin07 user=2084.6 wallclock=165
>> merlin09 user=2051.03 wallclock=165
>> merlin14 user=2061.3 wallclock=166
>> 36.424u 6.018s 2:47:52.05 0.4% 0+0k 0+160io 0pf+0w
>>> lcore -up (14:15:25) 1.252u 0.307s 0:01.78 87.0% 0+0k 1744+0io
>> 8pf+0w
>>> lcore -dn (14:15:27) 1.259u 0.291s 0:01.75 88.0% 0+0k 8+0io 0pf+0w
>>> mixer (14:15:35) 12.846u 5.332s 0:22.49 80.7% 0+0k 433168+0io
>> 12pf+0w
>> :ENERGY convergence: 0 0 0
>> :CHARGE convergence: 0 0.00005 0
>>
>> cycle 2 (Wed Mar 18 14:15:57 CET 2009) (499/98 to go
>>
>> -------------------------------------------------------------------------------------
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>
>
>
> --
> Laurence Marks
> Department of Materials Science and Engineering
> MSE Rm 2036 Cook Hall
> 2220 N Campus Drive
> Northwestern University
> Evanston, IL 60208, USA
> Tel: (847) 491-3996 Fax: (847) 491-7820
> email: L-marks at northwestern dot edu
> Web: www.numis.northwestern.edu
> Chair, Commission on Electron Crystallography of IUCR
> www.numis.northwestern.edu/
> Electron crystallography is the branch of science that uses electron
> scattering to study the structure of matter.
>
--
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Electron crystallography is the branch of science that uses electron
scattering to study the structure of matter.
More information about the Wien
mailing list