[Wien] excessive bandwidth usage
Anne-Christine Uldry
anne-christine.uldry at psi.ch
Thu Apr 2 17:49:57 CEST 2009
Dear Wien2k users,
I am wondering if anyone could comment or give me some pointers or general
info regarding the way wien2k deals with input/output with the
case.vector* files.
I have recently tried to run -admittedly oversized- wien2k calculations on
our local cluster (details below). By the time my calculations reached
cycle 2 of the SCF, other users were complaining (and quite rightly so)
that I was taking up all the communication bandwidth doing I/O with the
several gigabytes worth of case.vector* files.
This particular calculation is probably not appropriate for this cluster,
and/or badly set up.
My first question is however this:
Is there something one can do to limit reading/writing to the
vector files in some instances; in my case it looked like the vectors
could be held in memory instead of being written down. Are there variables
or settings that could be adjusted to prevent checkpointing ?
My second question, if anyone can be bothered to look at the details, is:
Could I set up this calculation in a better way at all ?
I am running wien2k version 09.1 on a fairly standard cluster. It has 24
compute nodes of 8 GB of RAM, carrying two AMD Opteron dual cores (2.4
GHz), and a one gigabit interconnect. The operating system is Scientific
linux 4. Maybe I should also mentionned that I have set ulimit -v to
2000000 and ulimit -m to 2097152. OMP_NUM_THREADS is set to 1, and SCRATCH
points to our scratch disk (don't know the size, but I never had a storage
problem there).
I compiled for k-point parallelisation only, using intel compiler 10.0 64
bit and the Intel-MKL libraries. NMATMAX had been set to 13000, NUME to
1000.
The system I looked at has 54 independent magnetic atoms in a 4129 au^3
supercell (no inversion symmetry). The matrix size was 6199 and I had 500
kpts. I requested 20 slots (out of a maximum of 32). The .machines file
looked like this:
granularity:1
plus 20 lines like this:
1:merlinXX:1
The command issued was "runsp_lapw -p".
The case.dayfile is reproduced below for the first cycle. Note that during
lapw1 -c the processors seem to be used to close to 100 percent, while in
lapw2 -c I get more like 20 percent.
Many thanks and best wishes,
Anne-Christine Uldry
-------------------------case.dayfile------------------------------------
start (Tue Mar 17 09:59:33 CET 2009) with lapw0 (500/99 to go)
cycle 1 (Tue Mar 17 09:59:33 CET 2009) (500/99 to go)
> lapw0 -p (09:59:33) starting parallel lapw0 at Tue Mar 17 09:59:33
CET 2009
-------- .machine0 : processors
running lapw0 in single mode
343.032u 3.523s 5:48.37 99.4% 0+0k 24+0io 13pf+0w
> lapw1 -c -up -p (10:05:22) starting parallel lapw1 at Tue Mar 17
10:05:22 CET 2009
-> starting parallel LAPW1 jobs at Tue Mar 17 10:05:23 CET 2009
running LAPW1 in parallel mode (using .machines)
20 number_of_parallel_jobs
merlin16(25) 37495.767u 154.792s 10:27:55.51 99.93% 0+0k 0+0io
0pf+0w
merlin05(25) 40045.286u 55.558s 11:08:35.54 99.96% 0+0k 0+0io
0pf+0w
merlin13(25) 38207.258u 55.456s 10:38:03.97 99.94% 0+0k 0+0io
0pf+0w
merlin20(25) 37391.182u 153.745s 10:25:58.52 99.96% 0+0k 0+0io
0pf+0w
merlin17(25) 37232.878u 147.786s 10:23:11.98 99.97% 0+0k 0+0io
0pf+0w
merlin15(25) 37528.695u 153.319s 10:28:14.53 99.97% 0+0k 0+0io
0pf+0w
merlin03(25) 38355.185u 53.653s 10:40:24.28 99.96% 0+0k 0+0io
0pf+0w
merlin21(25) 37258.630u 153.113s 10:23:42.03 99.97% 0+0k 0+0io
0pf+0w
merlin06(25) 38954.480u 49.951s 10:50:16.45 99.97% 0+0k 0+0io
0pf+0w
merlin18(25) 38570.612u 166.023s 10:45:49.15 99.97% 0+0k 0+0io
0pf+0w
merlin22(25) 38776.189u 182.754s 10:49:48.24 99.92% 0+0k 0+0io
0pf+0w
merlin24(25) 37065.451u 160.089s 10:20:35.9 99.97% 0+0k 0+0io
0pf+0w
merlin01(25) 39738.144u 77.905s 11:04:15.48 99.90% 0+0k 0+0io
0pf+0w
merlin10(25) 39316.326u 51.954s 10:56:19.00 99.97% 0+0k 0+0io
0pf+0w
merlin19(25) 40010.995u 155.071s 11:09:38.35 99.97% 0+0k 0+0io
0pf+0w
merlin04(25) 38738.890u 52.391s 10:46:42.77 99.97% 0+0k 0+0io
0pf+0w
merlin12(25) 39349.068u 57.359s 10:57:09.63 99.94% 0+0k 0+0io
0pf+0w
merlin07(25) 38638.182u 51.192s 10:45:02.37 99.97% 0+0k 0+0io
0pf+0w
merlin09(25) 39349.753u 120.713s 10:58:03.42 99.97% 0+0k 0+0io
0pf+0w
merlin14(25) 39202.712u 68.919s 10:54:52.43 99.95% 0+0k 0+0io
0pf+0w
Summary of lapw1para:
merlin16 k=25 user=37495.8 wallclock=627
merlin05 k=25 user=40045.3 wallclock=668
merlin13 k=25 user=38207.3 wallclock=638
merlin20 k=25 user=37391.2 wallclock=625
merlin17 k=25 user=37232.9 wallclock=623
merlin15 k=25 user=37528.7 wallclock=628
merlin03 k=25 user=38355.2 wallclock=640
merlin21 k=25 user=37258.6 wallclock=623
merlin06 k=25 user=38954.5 wallclock=650
merlin18 k=25 user=38570.6 wallclock=645
merlin22 k=25 user=38776.2 wallclock=649
merlin24 k=25 user=37065.5 wallclock=620
merlin01 k=25 user=39738.1 wallclock=664
merlin10 k=25 user=39316.3 wallclock=656
merlin19 k=25 user=40011 wallclock=669
merlin04 k=25 user=38738.9 wallclock=646
merlin12 k=25 user=39349.1 wallclock=657
merlin07 k=25 user=38638.2 wallclock=645
merlin09 k=25 user=39349.8 wallclock=658
merlin14 k=25 user=39202.7 wallclock=654
17.098u 72.346s 11:09:57.45 0.2% 0+0k 40+320io 0pf+0w
> lapw1 -c -dn -p (21:15:19) starting parallel lapw1 at Tue Mar 17
21:15:19 CET 2009
-> starting parallel LAPW1 jobs at Tue Mar 17 21:15:20 CET 2009
running LAPW1 in parallel mode (using .machines.help)
20 number_of_parallel_jobs
merlin16(25) 37751.881u 155.731s 10:32:11.94 99.94% 0+0k 0+0io
0pf+0w
merlin05(25) 39472.295u 104.662s 10:59:52.02 99.96% 0+0k 0+0io
0pf+0w
merlin13(25) 37835.579u 55.090s 10:31:56.91 99.93% 0+0k 0+0io
0pf+0w
merlin20(25) 37918.886u 184.326s 10:35:26.09 99.94% 0+0k 0+0io
0pf+0w
merlin17(25) 37265.070u 403.814s 10:28:33.48 99.88% 0+0k 0+0io
0pf+0w
merlin15(25) 37835.077u 157.843s 10:33:24.46 99.97% 0+0k 0+0io
0pf+0w
merlin03(25) 40200.683u 76.920s 11:11:48.85 99.92% 0+0k 0+0io
0pf+0w
merlin21(25) 38055.082u 157.867s 10:37:08.89 99.96% 0+0k 0+0io
0pf+0w
merlin06(25) 38471.013u 59.757s 10:42:41.02 99.92% 0+0k 0+0io
0pf+0w
merlin18(25) 39028.706u 160.356s 10:53:24.49 99.96% 0+0k 0+0io
0pf+0w
merlin22(25) 39672.912u 157.281s 11:04:14.35 99.94% 0+0k 0+0io
0pf+0w
merlin24(25) 37535.676u 244.072s 10:30:33.84 99.86% 0+0k 0+0io
0pf+0w
merlin01(25) 40867.680u 87.081s 11:23:18.53 99.89% 0+0k 0+0io
0pf+0w
merlin10(25) 38712.416u 52.325s 10:46:19.60 99.96% 0+0k 0+0io
0pf+0w
merlin19(25) 38589.740u 161.574s 10:46:03.82 99.97% 0+0k 0+0io
0pf+0w
merlin04(25) 38711.808u 53.267s 10:46:18.66 99.96% 0+0k 0+0io
0pf+0w
merlin12(25) 39539.575u 55.224s 11:16.21 99.95% 0+0k 0+0io
0pf+0w
merlin07(25) 37432.873u 48.500s 10:24:56.60 99.96% 0+0k 0+0io
0pf+0w
merlin09(25) 39396.568u 128.451s 10:59:07.64 99.94% 0+0k 0+0io
0pf+0w
merlin14(25) 39549.837u 79.790s 11:01:06.82 99.91% 0+0k 0+0io
0pf+0w
Summary of lapw1para:
merlin16 k=25 user=37751.9 wallclock=632
merlin05 k=25 user=39472.3 wallclock=659
merlin13 k=25 user=37835.6 wallclock=631
merlin20 k=25 user=37918.9 wallclock=635
merlin17 k=25 user=37265.1 wallclock=628
merlin15 k=25 user=37835.1 wallclock=633
merlin03 k=25 user=40200.7 wallclock=671
merlin21 k=25 user=38055.1 wallclock=637
merlin06 k=25 user=38471 wallclock=642
merlin18 k=25 user=39028.7 wallclock=653
merlin22 k=25 user=39672.9 wallclock=664
merlin24 k=25 user=37535.7 wallclock=630
merlin01 k=25 user=40867.7 wallclock=683
merlin10 k=25 user=38712.4 wallclock=646
merlin19 k=25 user=38589.7 wallclock=646
merlin04 k=25 user=38711.8 wallclock=646
merlin12 k=25 user=39539.6 wallclock=676.21
merlin07 k=25 user=37432.9 wallclock=624
merlin09 k=25 user=39396.6 wallclock=659
merlin14 k=25 user=39549.8 wallclock=661
17.925u 74.156s 11:23:35.72 0.2% 0+0k 0+320io 0pf+0w
> lapw2 -c -up -p (08:38:55) running LAPW2 in parallel mode
merlin16 2510.963u 91.613s 2:46:13.78 26.09% 0+0k 0+0io 0pf+0w
merlin05 2493.665u 79.846s 2:46:09.00 25.82% 0+0k 0+0io 0pf+0w
merlin13 2484.150u 78.296s 2:46:03.32 25.72% 0+0k 0+0io 0pf+0w
merlin20 2479.015u 104.637s 2:46:24.73 25.88% 0+0k 0+0io 0pf+0w
merlin17 2461.499u 103.356s 2:46:18.27 25.70% 0+0k 0+0io 0pf+0w
merlin15 2474.984u 88.321s 2:46:14.32 25.70% 0+0k 0+0io 0pf+0w
merlin03 2544.354u 86.533s 2:47:15.34 26.22% 0+0k 0+0io 0pf+0w
merlin21 2510.726u 105.354s 2:46:52.52 26.13% 0+0k 0+0io 0pf+0w
merlin06 2519.390u 87.796s 2:47:29.42 25.94% 0+0k 0+0io 0pf+0w
merlin18 2529.690u 122.496s 2:47:29.15 26.39% 0+0k 0+0io 0pf+0w
merlin22 2468.111u 114.877s 2:47:14.86 25.74% 0+0k 0+0io 0pf+0w
merlin24 2473.606u 112.375s 2:46:11.20 25.93% 0+0k 0+0io 0pf+0w
merlin01 2495.088u 93.868s 2:47:32.19 25.76% 0+0k 0+0io 0pf+0w
merlin10 2438.887u 77.412s 2:46:59.86 25.11% 0+0k 0+0io 0pf+0w
merlin19 2521.003u 95.387s 2:46:56.37 26.12% 0+0k 0+0io 0pf+0w
merlin04 2484.324u 78.449s 2:46:46.20 25.61% 0+0k 0+0io 0pf+0w
merlin12 2591.029u 95.105s 2:47:18.48 26.76% 0+0k 0+0io 0pf+0w
merlin07 2427.397u 77.359s 2:46:30.55 25.07% 0+0k 0+0io 0pf+0w
merlin09 2443.942u 82.523s 2:46:29.46 25.29% 0+0k 0+0io 0pf+0w
merlin14 2471.389u 89.183s 2:46:57.65 25.56% 0+0k 0+0io 0pf+0w
Summary of lapw2para:
merlin16 user=2510.96 wallclock=166
merlin05 user=2493.66 wallclock=166
merlin13 user=2484.15 wallclock=166
merlin20 user=2479.01 wallclock=166
merlin17 user=2461.5 wallclock=166
merlin15 user=2474.98 wallclock=166
merlin03 user=2544.35 wallclock=167
merlin21 user=2510.73 wallclock=166
merlin06 user=2519.39 wallclock=167
merlin18 user=2529.69 wallclock=167
merlin22 user=2468.11 wallclock=167
merlin24 user=2473.61 wallclock=166
merlin01 user=2495.09 wallclock=167
merlin10 user=2438.89 wallclock=166
merlin19 user=2521 wallclock=166
merlin04 user=2484.32 wallclock=166
merlin12 user=2591.03 wallclock=167
merlin07 user=2427.4 wallclock=166
merlin09 user=2443.94 wallclock=166
merlin14 user=2471.39 wallclock=166
36.088u 7.790s 2:48:38.21 0.4% 0+0k 24+160io 8pf+0w
> lapw2 -c -dn -p (11:27:33) running LAPW2 in parallel mode
merlin16 2156.929u 95.886s 2:45:56.30 22.63% 0+0k 0+0io 0pf+0w
merlin05 2135.272u 84.472s 2:45:27.61 22.36% 0+0k 0+0io 0pf+0w
merlin13 2071.188u 78.480s 2:45:21.64 21.67% 0+0k 0+0io 0pf+0w
merlin20 2151.438u 104.975s 2:46:04.56 22.64% 0+0k 0+0io 0pf+0w
merlin17 2133.444u 97.060s 2:45:30.52 22.46% 0+0k 0+0io 0pf+0w
merlin15 2081.077u 81.041s 2:45:48.49 21.73% 0+0k 0+0io 0pf+0w
merlin03 2137.855u 86.559s 2:46:28.13 22.27% 0+0k 0+0io 0pf+0w
merlin21 2093.567u 101.220s 2:46:17.25 22.00% 0+0k 0+0io 0pf+0w
merlin06 2143.250u 97.554s 2:46:54.45 22.38% 0+0k 0+0io 0pf+0w
merlin18 2084.752u 105.447s 2:46:26.67 21.93% 0+0k 0+0io 0pf+0w
merlin22 2082.295u 99.013s 2:46:43.88 21.80% 0+0k 0+0io 0pf+0w
merlin24 2072.152u 95.022s 2:45:44.16 21.79% 0+0k 0+0io 0pf+0w
merlin01 2132.420u 91.661s 2:46:35.91 22.25% 0+0k 0+0io 0pf+0w
merlin10 2118.587u 94.126s 2:46:39.64 22.13% 0+0k 0+0io 0pf+0w
merlin19 2102.943u 92.078s 2:46:20.96 21.99% 0+0k 0+0io 0pf+0w
merlin04 2089.082u 85.161s 2:46:19.56 21.79% 0+0k 0+0io 0pf+0w
merlin12 2144.932u 87.126s 2:46:19.10 22.37% 0+0k 0+0io 0pf+0w
merlin07 2084.597u 83.871s 2:45:57.54 21.78% 0+0k 0+0io 0pf+0w
merlin09 2051.034u 75.865s 2:45:49.55 21.38% 0+0k 0+0io 0pf+0w
merlin14 2061.305u 87.110s 2:46:01.46 21.57% 0+0k 0+0io 0pf+0w
Summary of lapw2para:
merlin16 user=2156.93 wallclock=165
merlin05 user=2135.27 wallclock=165
merlin13 user=2071.19 wallclock=165
merlin20 user=2151.44 wallclock=166
merlin17 user=2133.44 wallclock=165
merlin15 user=2081.08 wallclock=165
merlin03 user=2137.86 wallclock=166
merlin21 user=2093.57 wallclock=166
merlin06 user=2143.25 wallclock=166
merlin18 user=2084.75 wallclock=166
merlin22 user=2082.3 wallclock=166
merlin24 user=2072.15 wallclock=165
merlin01 user=2132.42 wallclock=166
merlin10 user=2118.59 wallclock=166
merlin19 user=2102.94 wallclock=166
merlin04 user=2089.08 wallclock=166
merlin12 user=2144.93 wallclock=166
merlin07 user=2084.6 wallclock=165
merlin09 user=2051.03 wallclock=165
merlin14 user=2061.3 wallclock=166
36.424u 6.018s 2:47:52.05 0.4% 0+0k 0+160io 0pf+0w
> lcore -up (14:15:25) 1.252u 0.307s 0:01.78 87.0% 0+0k 1744+0io
8pf+0w
> lcore -dn (14:15:27) 1.259u 0.291s 0:01.75 88.0% 0+0k 8+0io 0pf+0w
> mixer (14:15:35) 12.846u 5.332s 0:22.49 80.7% 0+0k 433168+0io
12pf+0w
:ENERGY convergence: 0 0 0
:CHARGE convergence: 0 0.00005 0
cycle 2 (Wed Mar 18 14:15:57 CET 2009) (499/98 to go
-------------------------------------------------------------------------------------
More information about the Wien
mailing list