[Wien] excessive bandwidth usage
Laurence Marks
L-marks at northwestern.edu
Thu Apr 2 23:23:44 CEST 2009
For your two questions:
a) Look in the userguide (pdf will be easier for a search) for the
SCRATCH environmental variable. You may be able to store the
case.vector* files on each compute node if your system is setup so you
can reach a temporary directory.
b) In terms of setting your problem up better, no idea -- we would
need to know more about what it is, e.g. RKM, # k-points....
On Thu, Apr 2, 2009 at 10:49 AM, Anne-Christine Uldry
<anne-christine.uldry at psi.ch> wrote:
> Dear Wien2k users,
>
> I am wondering if anyone could comment or give me some pointers or general
> info regarding the way wien2k deals with input/output with the
> case.vector* files.
>
> I have recently tried to run -admittedly oversized- wien2k calculations on
> our local cluster (details below). By the time my calculations reached
> cycle 2 of the SCF, other users were complaining (and quite rightly so)
> that I was taking up all the communication bandwidth doing I/O with the
> several gigabytes worth of case.vector* files.
>
> This particular calculation is probably not appropriate for this cluster,
> and/or badly set up.
>
> My first question is however this:
> Is there something one can do to limit reading/writing to the
> vector files in some instances; in my case it looked like the vectors
> could be held in memory instead of being written down. Are there variables
> or settings that could be adjusted to prevent checkpointing ?
>
> My second question, if anyone can be bothered to look at the details, is:
> Could I set up this calculation in a better way at all ?
>
>
> I am running wien2k version 09.1 on a fairly standard cluster. It has 24
> compute nodes of 8 GB of RAM, carrying two AMD Opteron dual cores (2.4
> GHz), and a one gigabit interconnect. The operating system is Scientific
> linux 4. Maybe I should also mentionned that I have set ulimit -v to
> 2000000 and ulimit -m to 2097152. OMP_NUM_THREADS is set to 1, and SCRATCH
> points to our scratch disk (don't know the size, but I never had a storage
> problem there).
> I compiled for k-point parallelisation only, using intel compiler 10.0 64
> bit and the Intel-MKL libraries. NMATMAX had been set to 13000, NUME to
> 1000.
> The system I looked at has 54 independent magnetic atoms in a 4129 au^3
> supercell (no inversion symmetry). The matrix size was 6199 and I had 500
> kpts. I requested 20 slots (out of a maximum of 32). The .machines file
> looked like this:
>
> granularity:1
>
> plus 20 lines like this:
>
> 1:merlinXX:1
>
> The command issued was "runsp_lapw -p".
> The case.dayfile is reproduced below for the first cycle. Note that during
> lapw1 -c the processors seem to be used to close to 100 percent, while in
> lapw2 -c I get more like 20 percent.
>
>
> Many thanks and best wishes,
> Anne-Christine Uldry
>
>
>
>
> -------------------------case.dayfile------------------------------------
>
>
>
> start (Tue Mar 17 09:59:33 CET 2009) with lapw0 (500/99 to go)
>
> cycle 1 (Tue Mar 17 09:59:33 CET 2009) (500/99 to go)
>
>> lapw0 -p (09:59:33) starting parallel lapw0 at Tue Mar 17 09:59:33
> CET 2009
> -------- .machine0 : processors
> running lapw0 in single mode
> 343.032u 3.523s 5:48.37 99.4% 0+0k 24+0io 13pf+0w
>> lapw1 -c -up -p (10:05:22) starting parallel lapw1 at Tue Mar 17
> 10:05:22 CET 2009
> -> starting parallel LAPW1 jobs at Tue Mar 17 10:05:23 CET 2009
> running LAPW1 in parallel mode (using .machines)
> 20 number_of_parallel_jobs
> merlin16(25) 37495.767u 154.792s 10:27:55.51 99.93% 0+0k 0+0io
> 0pf+0w
> merlin05(25) 40045.286u 55.558s 11:08:35.54 99.96% 0+0k 0+0io
> 0pf+0w
> merlin13(25) 38207.258u 55.456s 10:38:03.97 99.94% 0+0k 0+0io
> 0pf+0w
> merlin20(25) 37391.182u 153.745s 10:25:58.52 99.96% 0+0k 0+0io
> 0pf+0w
> merlin17(25) 37232.878u 147.786s 10:23:11.98 99.97% 0+0k 0+0io
> 0pf+0w
> merlin15(25) 37528.695u 153.319s 10:28:14.53 99.97% 0+0k 0+0io
> 0pf+0w
> merlin03(25) 38355.185u 53.653s 10:40:24.28 99.96% 0+0k 0+0io
> 0pf+0w
> merlin21(25) 37258.630u 153.113s 10:23:42.03 99.97% 0+0k 0+0io
> 0pf+0w
> merlin06(25) 38954.480u 49.951s 10:50:16.45 99.97% 0+0k 0+0io
> 0pf+0w
> merlin18(25) 38570.612u 166.023s 10:45:49.15 99.97% 0+0k 0+0io
> 0pf+0w
> merlin22(25) 38776.189u 182.754s 10:49:48.24 99.92% 0+0k 0+0io
> 0pf+0w
> merlin24(25) 37065.451u 160.089s 10:20:35.9 99.97% 0+0k 0+0io
> 0pf+0w
> merlin01(25) 39738.144u 77.905s 11:04:15.48 99.90% 0+0k 0+0io
> 0pf+0w
> merlin10(25) 39316.326u 51.954s 10:56:19.00 99.97% 0+0k 0+0io
> 0pf+0w
> merlin19(25) 40010.995u 155.071s 11:09:38.35 99.97% 0+0k 0+0io
> 0pf+0w
> merlin04(25) 38738.890u 52.391s 10:46:42.77 99.97% 0+0k 0+0io
> 0pf+0w
> merlin12(25) 39349.068u 57.359s 10:57:09.63 99.94% 0+0k 0+0io
> 0pf+0w
> merlin07(25) 38638.182u 51.192s 10:45:02.37 99.97% 0+0k 0+0io
> 0pf+0w
> merlin09(25) 39349.753u 120.713s 10:58:03.42 99.97% 0+0k 0+0io
> 0pf+0w
> merlin14(25) 39202.712u 68.919s 10:54:52.43 99.95% 0+0k 0+0io
> 0pf+0w
> Summary of lapw1para:
> merlin16 k=25 user=37495.8 wallclock=627
> merlin05 k=25 user=40045.3 wallclock=668
> merlin13 k=25 user=38207.3 wallclock=638
> merlin20 k=25 user=37391.2 wallclock=625
> merlin17 k=25 user=37232.9 wallclock=623
> merlin15 k=25 user=37528.7 wallclock=628
> merlin03 k=25 user=38355.2 wallclock=640
> merlin21 k=25 user=37258.6 wallclock=623
> merlin06 k=25 user=38954.5 wallclock=650
> merlin18 k=25 user=38570.6 wallclock=645
> merlin22 k=25 user=38776.2 wallclock=649
> merlin24 k=25 user=37065.5 wallclock=620
> merlin01 k=25 user=39738.1 wallclock=664
> merlin10 k=25 user=39316.3 wallclock=656
> merlin19 k=25 user=40011 wallclock=669
> merlin04 k=25 user=38738.9 wallclock=646
> merlin12 k=25 user=39349.1 wallclock=657
> merlin07 k=25 user=38638.2 wallclock=645
> merlin09 k=25 user=39349.8 wallclock=658
> merlin14 k=25 user=39202.7 wallclock=654
> 17.098u 72.346s 11:09:57.45 0.2% 0+0k 40+320io 0pf+0w
>> lapw1 -c -dn -p (21:15:19) starting parallel lapw1 at Tue Mar 17
> 21:15:19 CET 2009
> -> starting parallel LAPW1 jobs at Tue Mar 17 21:15:20 CET 2009
> running LAPW1 in parallel mode (using .machines.help)
> 20 number_of_parallel_jobs
> merlin16(25) 37751.881u 155.731s 10:32:11.94 99.94% 0+0k 0+0io
> 0pf+0w
> merlin05(25) 39472.295u 104.662s 10:59:52.02 99.96% 0+0k 0+0io
> 0pf+0w
> merlin13(25) 37835.579u 55.090s 10:31:56.91 99.93% 0+0k 0+0io
> 0pf+0w
> merlin20(25) 37918.886u 184.326s 10:35:26.09 99.94% 0+0k 0+0io
> 0pf+0w
> merlin17(25) 37265.070u 403.814s 10:28:33.48 99.88% 0+0k 0+0io
> 0pf+0w
> merlin15(25) 37835.077u 157.843s 10:33:24.46 99.97% 0+0k 0+0io
> 0pf+0w
> merlin03(25) 40200.683u 76.920s 11:11:48.85 99.92% 0+0k 0+0io
> 0pf+0w
> merlin21(25) 38055.082u 157.867s 10:37:08.89 99.96% 0+0k 0+0io
> 0pf+0w
> merlin06(25) 38471.013u 59.757s 10:42:41.02 99.92% 0+0k 0+0io
> 0pf+0w
> merlin18(25) 39028.706u 160.356s 10:53:24.49 99.96% 0+0k 0+0io
> 0pf+0w
> merlin22(25) 39672.912u 157.281s 11:04:14.35 99.94% 0+0k 0+0io
> 0pf+0w
> merlin24(25) 37535.676u 244.072s 10:30:33.84 99.86% 0+0k 0+0io
> 0pf+0w
> merlin01(25) 40867.680u 87.081s 11:23:18.53 99.89% 0+0k 0+0io
> 0pf+0w
> merlin10(25) 38712.416u 52.325s 10:46:19.60 99.96% 0+0k 0+0io
> 0pf+0w
> merlin19(25) 38589.740u 161.574s 10:46:03.82 99.97% 0+0k 0+0io
> 0pf+0w
> merlin04(25) 38711.808u 53.267s 10:46:18.66 99.96% 0+0k 0+0io
> 0pf+0w
> merlin12(25) 39539.575u 55.224s 11:16.21 99.95% 0+0k 0+0io
> 0pf+0w
> merlin07(25) 37432.873u 48.500s 10:24:56.60 99.96% 0+0k 0+0io
> 0pf+0w
> merlin09(25) 39396.568u 128.451s 10:59:07.64 99.94% 0+0k 0+0io
> 0pf+0w
> merlin14(25) 39549.837u 79.790s 11:01:06.82 99.91% 0+0k 0+0io
> 0pf+0w
> Summary of lapw1para:
> merlin16 k=25 user=37751.9 wallclock=632
> merlin05 k=25 user=39472.3 wallclock=659
> merlin13 k=25 user=37835.6 wallclock=631
> merlin20 k=25 user=37918.9 wallclock=635
> merlin17 k=25 user=37265.1 wallclock=628
> merlin15 k=25 user=37835.1 wallclock=633
> merlin03 k=25 user=40200.7 wallclock=671
> merlin21 k=25 user=38055.1 wallclock=637
> merlin06 k=25 user=38471 wallclock=642
> merlin18 k=25 user=39028.7 wallclock=653
> merlin22 k=25 user=39672.9 wallclock=664
> merlin24 k=25 user=37535.7 wallclock=630
> merlin01 k=25 user=40867.7 wallclock=683
> merlin10 k=25 user=38712.4 wallclock=646
> merlin19 k=25 user=38589.7 wallclock=646
> merlin04 k=25 user=38711.8 wallclock=646
> merlin12 k=25 user=39539.6 wallclock=676.21
> merlin07 k=25 user=37432.9 wallclock=624
> merlin09 k=25 user=39396.6 wallclock=659
> merlin14 k=25 user=39549.8 wallclock=661
> 17.925u 74.156s 11:23:35.72 0.2% 0+0k 0+320io 0pf+0w
>> lapw2 -c -up -p (08:38:55) running LAPW2 in parallel mode
> merlin16 2510.963u 91.613s 2:46:13.78 26.09% 0+0k 0+0io 0pf+0w
> merlin05 2493.665u 79.846s 2:46:09.00 25.82% 0+0k 0+0io 0pf+0w
> merlin13 2484.150u 78.296s 2:46:03.32 25.72% 0+0k 0+0io 0pf+0w
> merlin20 2479.015u 104.637s 2:46:24.73 25.88% 0+0k 0+0io 0pf+0w
> merlin17 2461.499u 103.356s 2:46:18.27 25.70% 0+0k 0+0io 0pf+0w
> merlin15 2474.984u 88.321s 2:46:14.32 25.70% 0+0k 0+0io 0pf+0w
> merlin03 2544.354u 86.533s 2:47:15.34 26.22% 0+0k 0+0io 0pf+0w
> merlin21 2510.726u 105.354s 2:46:52.52 26.13% 0+0k 0+0io 0pf+0w
> merlin06 2519.390u 87.796s 2:47:29.42 25.94% 0+0k 0+0io 0pf+0w
> merlin18 2529.690u 122.496s 2:47:29.15 26.39% 0+0k 0+0io 0pf+0w
> merlin22 2468.111u 114.877s 2:47:14.86 25.74% 0+0k 0+0io 0pf+0w
> merlin24 2473.606u 112.375s 2:46:11.20 25.93% 0+0k 0+0io 0pf+0w
> merlin01 2495.088u 93.868s 2:47:32.19 25.76% 0+0k 0+0io 0pf+0w
> merlin10 2438.887u 77.412s 2:46:59.86 25.11% 0+0k 0+0io 0pf+0w
> merlin19 2521.003u 95.387s 2:46:56.37 26.12% 0+0k 0+0io 0pf+0w
> merlin04 2484.324u 78.449s 2:46:46.20 25.61% 0+0k 0+0io 0pf+0w
> merlin12 2591.029u 95.105s 2:47:18.48 26.76% 0+0k 0+0io 0pf+0w
> merlin07 2427.397u 77.359s 2:46:30.55 25.07% 0+0k 0+0io 0pf+0w
> merlin09 2443.942u 82.523s 2:46:29.46 25.29% 0+0k 0+0io 0pf+0w
> merlin14 2471.389u 89.183s 2:46:57.65 25.56% 0+0k 0+0io 0pf+0w
> Summary of lapw2para:
> merlin16 user=2510.96 wallclock=166
> merlin05 user=2493.66 wallclock=166
> merlin13 user=2484.15 wallclock=166
> merlin20 user=2479.01 wallclock=166
> merlin17 user=2461.5 wallclock=166
> merlin15 user=2474.98 wallclock=166
> merlin03 user=2544.35 wallclock=167
> merlin21 user=2510.73 wallclock=166
> merlin06 user=2519.39 wallclock=167
> merlin18 user=2529.69 wallclock=167
> merlin22 user=2468.11 wallclock=167
> merlin24 user=2473.61 wallclock=166
> merlin01 user=2495.09 wallclock=167
> merlin10 user=2438.89 wallclock=166
> merlin19 user=2521 wallclock=166
> merlin04 user=2484.32 wallclock=166
> merlin12 user=2591.03 wallclock=167
> merlin07 user=2427.4 wallclock=166
> merlin09 user=2443.94 wallclock=166
> merlin14 user=2471.39 wallclock=166
> 36.088u 7.790s 2:48:38.21 0.4% 0+0k 24+160io 8pf+0w
>> lapw2 -c -dn -p (11:27:33) running LAPW2 in parallel mode
> merlin16 2156.929u 95.886s 2:45:56.30 22.63% 0+0k 0+0io 0pf+0w
> merlin05 2135.272u 84.472s 2:45:27.61 22.36% 0+0k 0+0io 0pf+0w
> merlin13 2071.188u 78.480s 2:45:21.64 21.67% 0+0k 0+0io 0pf+0w
> merlin20 2151.438u 104.975s 2:46:04.56 22.64% 0+0k 0+0io 0pf+0w
> merlin17 2133.444u 97.060s 2:45:30.52 22.46% 0+0k 0+0io 0pf+0w
> merlin15 2081.077u 81.041s 2:45:48.49 21.73% 0+0k 0+0io 0pf+0w
> merlin03 2137.855u 86.559s 2:46:28.13 22.27% 0+0k 0+0io 0pf+0w
> merlin21 2093.567u 101.220s 2:46:17.25 22.00% 0+0k 0+0io 0pf+0w
> merlin06 2143.250u 97.554s 2:46:54.45 22.38% 0+0k 0+0io 0pf+0w
> merlin18 2084.752u 105.447s 2:46:26.67 21.93% 0+0k 0+0io 0pf+0w
> merlin22 2082.295u 99.013s 2:46:43.88 21.80% 0+0k 0+0io 0pf+0w
> merlin24 2072.152u 95.022s 2:45:44.16 21.79% 0+0k 0+0io 0pf+0w
> merlin01 2132.420u 91.661s 2:46:35.91 22.25% 0+0k 0+0io 0pf+0w
> merlin10 2118.587u 94.126s 2:46:39.64 22.13% 0+0k 0+0io 0pf+0w
> merlin19 2102.943u 92.078s 2:46:20.96 21.99% 0+0k 0+0io 0pf+0w
> merlin04 2089.082u 85.161s 2:46:19.56 21.79% 0+0k 0+0io 0pf+0w
> merlin12 2144.932u 87.126s 2:46:19.10 22.37% 0+0k 0+0io 0pf+0w
> merlin07 2084.597u 83.871s 2:45:57.54 21.78% 0+0k 0+0io 0pf+0w
> merlin09 2051.034u 75.865s 2:45:49.55 21.38% 0+0k 0+0io 0pf+0w
> merlin14 2061.305u 87.110s 2:46:01.46 21.57% 0+0k 0+0io 0pf+0w
> Summary of lapw2para:
> merlin16 user=2156.93 wallclock=165
> merlin05 user=2135.27 wallclock=165
> merlin13 user=2071.19 wallclock=165
> merlin20 user=2151.44 wallclock=166
> merlin17 user=2133.44 wallclock=165
> merlin15 user=2081.08 wallclock=165
> merlin03 user=2137.86 wallclock=166
> merlin21 user=2093.57 wallclock=166
> merlin06 user=2143.25 wallclock=166
> merlin18 user=2084.75 wallclock=166
> merlin22 user=2082.3 wallclock=166
> merlin24 user=2072.15 wallclock=165
> merlin01 user=2132.42 wallclock=166
> merlin10 user=2118.59 wallclock=166
> merlin19 user=2102.94 wallclock=166
> merlin04 user=2089.08 wallclock=166
> merlin12 user=2144.93 wallclock=166
> merlin07 user=2084.6 wallclock=165
> merlin09 user=2051.03 wallclock=165
> merlin14 user=2061.3 wallclock=166
> 36.424u 6.018s 2:47:52.05 0.4% 0+0k 0+160io 0pf+0w
>> lcore -up (14:15:25) 1.252u 0.307s 0:01.78 87.0% 0+0k 1744+0io
> 8pf+0w
>> lcore -dn (14:15:27) 1.259u 0.291s 0:01.75 88.0% 0+0k 8+0io 0pf+0w
>> mixer (14:15:35) 12.846u 5.332s 0:22.49 80.7% 0+0k 433168+0io
> 12pf+0w
> :ENERGY convergence: 0 0 0
> :CHARGE convergence: 0 0.00005 0
>
> cycle 2 (Wed Mar 18 14:15:57 CET 2009) (499/98 to go
>
> -------------------------------------------------------------------------------------
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
--
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Electron crystallography is the branch of science that uses electron
scattering to study the structure of matter.
More information about the Wien
mailing list