[Wien] excessive bandwidth usage

Laurence Marks L-marks at northwestern.edu
Thu Apr 2 23:23:44 CEST 2009


For your two questions:
a) Look in the userguide (pdf will be easier for a search) for the
SCRATCH environmental variable. You may be able to store the
case.vector* files on each compute node if your system is setup so you
can reach a temporary directory.
b) In terms of setting your problem up better, no idea -- we would
need to know more about what it is, e.g. RKM, # k-points....

On Thu, Apr 2, 2009 at 10:49 AM, Anne-Christine Uldry
<anne-christine.uldry at psi.ch> wrote:
> Dear Wien2k users,
>
> I am wondering if anyone could comment or give me some pointers or general
> info regarding the way wien2k deals with input/output with the
> case.vector* files.
>
> I have recently tried to run -admittedly oversized- wien2k calculations on
> our local cluster (details below). By the time my calculations reached
> cycle 2 of the SCF, other users were complaining (and quite rightly so)
> that I was taking up all the communication bandwidth doing I/O with the
> several gigabytes worth of case.vector* files.
>
> This particular calculation is probably not appropriate for this cluster,
> and/or badly set up.
>
> My first question is however this:
> Is there something one can do to limit reading/writing to the
> vector files in some instances; in my case it looked like the vectors
> could be held in memory instead of being written down. Are there variables
> or settings that could be adjusted to prevent checkpointing ?
>
> My second question, if anyone can be bothered to look at the details, is:
> Could I set up this calculation in a better way at all ?
>
>
> I am running wien2k version 09.1 on a fairly standard cluster. It has 24
> compute nodes of 8 GB of RAM, carrying two AMD Opteron dual cores (2.4
> GHz), and a one gigabit interconnect. The operating system is Scientific
> linux 4. Maybe I should also mentionned that I have set ulimit -v to
> 2000000 and ulimit -m to 2097152. OMP_NUM_THREADS is set to 1, and SCRATCH
> points to our scratch disk (don't know the size, but I never had a storage
> problem there).
> I compiled for k-point parallelisation only, using intel compiler 10.0 64
> bit and the Intel-MKL libraries. NMATMAX had been set to 13000, NUME to
> 1000.
> The system I looked at has 54 independent magnetic atoms in a 4129 au^3
> supercell (no inversion symmetry). The matrix size was 6199 and I had 500
> kpts. I requested 20 slots (out of a maximum of 32). The .machines file
> looked like this:
>
> granularity:1
>
> plus 20 lines like this:
>
> 1:merlinXX:1
>
> The command issued was "runsp_lapw -p".
> The case.dayfile is reproduced below for the first cycle. Note that during
> lapw1 -c the processors seem to be used to close to 100 percent, while in
> lapw2 -c I get more like 20 percent.
>
>
> Many thanks and best wishes,
> Anne-Christine Uldry
>
>
>
>
> -------------------------case.dayfile------------------------------------
>
>
>
>     start      (Tue Mar 17 09:59:33 CET 2009) with lapw0 (500/99 to go)
>
>     cycle 1    (Tue Mar 17 09:59:33 CET 2009)  (500/99 to go)
>
>>   lapw0 -p    (09:59:33) starting parallel lapw0 at Tue Mar 17 09:59:33
> CET 2009
> -------- .machine0 : processors
> running lapw0 in single mode
> 343.032u 3.523s 5:48.37 99.4%   0+0k 24+0io 13pf+0w
>>   lapw1  -c -up -p    (10:05:22) starting parallel lapw1 at Tue Mar 17
> 10:05:22 CET 2009
> ->  starting parallel LAPW1 jobs at Tue Mar 17 10:05:23 CET 2009
> running LAPW1 in parallel mode (using .machines)
> 20 number_of_parallel_jobs
>      merlin16(25) 37495.767u 154.792s 10:27:55.51 99.93%      0+0k 0+0io
> 0pf+0w
>      merlin05(25) 40045.286u 55.558s 11:08:35.54 99.96%      0+0k 0+0io
> 0pf+0w
>      merlin13(25) 38207.258u 55.456s 10:38:03.97 99.94%      0+0k 0+0io
> 0pf+0w
>      merlin20(25) 37391.182u 153.745s 10:25:58.52 99.96%      0+0k 0+0io
> 0pf+0w
>      merlin17(25) 37232.878u 147.786s 10:23:11.98 99.97%      0+0k 0+0io
> 0pf+0w
>      merlin15(25) 37528.695u 153.319s 10:28:14.53 99.97%      0+0k 0+0io
> 0pf+0w
>      merlin03(25) 38355.185u 53.653s 10:40:24.28 99.96%      0+0k 0+0io
> 0pf+0w
>      merlin21(25) 37258.630u 153.113s 10:23:42.03 99.97%      0+0k 0+0io
> 0pf+0w
>      merlin06(25) 38954.480u 49.951s 10:50:16.45 99.97%      0+0k 0+0io
> 0pf+0w
>      merlin18(25) 38570.612u 166.023s 10:45:49.15 99.97%      0+0k 0+0io
> 0pf+0w
>      merlin22(25) 38776.189u 182.754s 10:49:48.24 99.92%      0+0k 0+0io
> 0pf+0w
>      merlin24(25) 37065.451u 160.089s 10:20:35.9 99.97%      0+0k 0+0io
> 0pf+0w
>      merlin01(25) 39738.144u 77.905s 11:04:15.48 99.90%      0+0k 0+0io
> 0pf+0w
>      merlin10(25) 39316.326u 51.954s 10:56:19.00 99.97%      0+0k 0+0io
> 0pf+0w
>      merlin19(25) 40010.995u 155.071s 11:09:38.35 99.97%      0+0k 0+0io
> 0pf+0w
>      merlin04(25) 38738.890u 52.391s 10:46:42.77 99.97%      0+0k 0+0io
> 0pf+0w
>      merlin12(25) 39349.068u 57.359s 10:57:09.63 99.94%      0+0k 0+0io
> 0pf+0w
>      merlin07(25) 38638.182u 51.192s 10:45:02.37 99.97%      0+0k 0+0io
> 0pf+0w
>      merlin09(25) 39349.753u 120.713s 10:58:03.42 99.97%      0+0k 0+0io
> 0pf+0w
>      merlin14(25) 39202.712u 68.919s 10:54:52.43 99.95%      0+0k 0+0io
> 0pf+0w
>    Summary of lapw1para:
>    merlin16     k=25    user=37495.8    wallclock=627
>    merlin05     k=25    user=40045.3    wallclock=668
>    merlin13     k=25    user=38207.3    wallclock=638
>    merlin20     k=25    user=37391.2    wallclock=625
>    merlin17     k=25    user=37232.9    wallclock=623
>    merlin15     k=25    user=37528.7    wallclock=628
>    merlin03     k=25    user=38355.2    wallclock=640
>    merlin21     k=25    user=37258.6    wallclock=623
>    merlin06     k=25    user=38954.5    wallclock=650
>    merlin18     k=25    user=38570.6    wallclock=645
>    merlin22     k=25    user=38776.2    wallclock=649
>    merlin24     k=25    user=37065.5    wallclock=620
>    merlin01     k=25    user=39738.1    wallclock=664
>    merlin10     k=25    user=39316.3    wallclock=656
>    merlin19     k=25    user=40011      wallclock=669
>    merlin04     k=25    user=38738.9    wallclock=646
>    merlin12     k=25    user=39349.1    wallclock=657
>    merlin07     k=25    user=38638.2    wallclock=645
>    merlin09     k=25    user=39349.8    wallclock=658
>    merlin14     k=25    user=39202.7    wallclock=654
> 17.098u 72.346s 11:09:57.45 0.2%        0+0k 40+320io 0pf+0w
>>   lapw1  -c -dn -p    (21:15:19) starting parallel lapw1 at Tue Mar 17
> 21:15:19 CET 2009
> ->  starting parallel LAPW1 jobs at Tue Mar 17 21:15:20 CET 2009
> running LAPW1 in parallel mode (using .machines.help)
> 20 number_of_parallel_jobs
>      merlin16(25) 37751.881u 155.731s 10:32:11.94 99.94%      0+0k 0+0io
> 0pf+0w
>      merlin05(25) 39472.295u 104.662s 10:59:52.02 99.96%      0+0k 0+0io
> 0pf+0w
>      merlin13(25) 37835.579u 55.090s 10:31:56.91 99.93%      0+0k 0+0io
> 0pf+0w
>      merlin20(25) 37918.886u 184.326s 10:35:26.09 99.94%      0+0k 0+0io
> 0pf+0w
>      merlin17(25) 37265.070u 403.814s 10:28:33.48 99.88%      0+0k 0+0io
> 0pf+0w
>      merlin15(25) 37835.077u 157.843s 10:33:24.46 99.97%      0+0k 0+0io
> 0pf+0w
>      merlin03(25) 40200.683u 76.920s 11:11:48.85 99.92%      0+0k 0+0io
> 0pf+0w
>      merlin21(25) 38055.082u 157.867s 10:37:08.89 99.96%      0+0k 0+0io
> 0pf+0w
>      merlin06(25) 38471.013u 59.757s 10:42:41.02 99.92%      0+0k 0+0io
> 0pf+0w
>      merlin18(25) 39028.706u 160.356s 10:53:24.49 99.96%      0+0k 0+0io
> 0pf+0w
>      merlin22(25) 39672.912u 157.281s 11:04:14.35 99.94%      0+0k 0+0io
> 0pf+0w
>      merlin24(25) 37535.676u 244.072s 10:30:33.84 99.86%      0+0k 0+0io
> 0pf+0w
>      merlin01(25) 40867.680u 87.081s 11:23:18.53 99.89%      0+0k 0+0io
> 0pf+0w
>      merlin10(25) 38712.416u 52.325s 10:46:19.60 99.96%      0+0k 0+0io
> 0pf+0w
>      merlin19(25) 38589.740u 161.574s 10:46:03.82 99.97%      0+0k 0+0io
> 0pf+0w
>      merlin04(25) 38711.808u 53.267s 10:46:18.66 99.96%      0+0k 0+0io
> 0pf+0w
>      merlin12(25) 39539.575u 55.224s 11:16.21 99.95%      0+0k 0+0io
> 0pf+0w
>      merlin07(25) 37432.873u 48.500s 10:24:56.60 99.96%      0+0k 0+0io
> 0pf+0w
>      merlin09(25) 39396.568u 128.451s 10:59:07.64 99.94%      0+0k 0+0io
> 0pf+0w
>      merlin14(25) 39549.837u 79.790s 11:01:06.82 99.91%      0+0k 0+0io
> 0pf+0w
>    Summary of lapw1para:
>    merlin16     k=25    user=37751.9    wallclock=632
>    merlin05     k=25    user=39472.3    wallclock=659
>    merlin13     k=25    user=37835.6    wallclock=631
>    merlin20     k=25    user=37918.9    wallclock=635
>    merlin17     k=25    user=37265.1    wallclock=628
>    merlin15     k=25    user=37835.1    wallclock=633
>    merlin03     k=25    user=40200.7    wallclock=671
>    merlin21     k=25    user=38055.1    wallclock=637
>    merlin06     k=25    user=38471      wallclock=642
>    merlin18     k=25    user=39028.7    wallclock=653
>    merlin22     k=25    user=39672.9    wallclock=664
>    merlin24     k=25    user=37535.7    wallclock=630
>    merlin01     k=25    user=40867.7    wallclock=683
>    merlin10     k=25    user=38712.4    wallclock=646
>    merlin19     k=25    user=38589.7    wallclock=646
>    merlin04     k=25    user=38711.8    wallclock=646
>    merlin12     k=25    user=39539.6    wallclock=676.21
>    merlin07     k=25    user=37432.9    wallclock=624
>    merlin09     k=25    user=39396.6    wallclock=659
>    merlin14     k=25    user=39549.8    wallclock=661
> 17.925u 74.156s 11:23:35.72 0.2%        0+0k 0+320io 0pf+0w
>>   lapw2 -c -up  -p    (08:38:55) running LAPW2 in parallel mode
>       merlin16 2510.963u 91.613s 2:46:13.78 26.09% 0+0k 0+0io 0pf+0w
>       merlin05 2493.665u 79.846s 2:46:09.00 25.82% 0+0k 0+0io 0pf+0w
>       merlin13 2484.150u 78.296s 2:46:03.32 25.72% 0+0k 0+0io 0pf+0w
>       merlin20 2479.015u 104.637s 2:46:24.73 25.88% 0+0k 0+0io 0pf+0w
>       merlin17 2461.499u 103.356s 2:46:18.27 25.70% 0+0k 0+0io 0pf+0w
>       merlin15 2474.984u 88.321s 2:46:14.32 25.70% 0+0k 0+0io 0pf+0w
>       merlin03 2544.354u 86.533s 2:47:15.34 26.22% 0+0k 0+0io 0pf+0w
>       merlin21 2510.726u 105.354s 2:46:52.52 26.13% 0+0k 0+0io 0pf+0w
>       merlin06 2519.390u 87.796s 2:47:29.42 25.94% 0+0k 0+0io 0pf+0w
>       merlin18 2529.690u 122.496s 2:47:29.15 26.39% 0+0k 0+0io 0pf+0w
>       merlin22 2468.111u 114.877s 2:47:14.86 25.74% 0+0k 0+0io 0pf+0w
>       merlin24 2473.606u 112.375s 2:46:11.20 25.93% 0+0k 0+0io 0pf+0w
>       merlin01 2495.088u 93.868s 2:47:32.19 25.76% 0+0k 0+0io 0pf+0w
>       merlin10 2438.887u 77.412s 2:46:59.86 25.11% 0+0k 0+0io 0pf+0w
>       merlin19 2521.003u 95.387s 2:46:56.37 26.12% 0+0k 0+0io 0pf+0w
>       merlin04 2484.324u 78.449s 2:46:46.20 25.61% 0+0k 0+0io 0pf+0w
>       merlin12 2591.029u 95.105s 2:47:18.48 26.76% 0+0k 0+0io 0pf+0w
>       merlin07 2427.397u 77.359s 2:46:30.55 25.07% 0+0k 0+0io 0pf+0w
>       merlin09 2443.942u 82.523s 2:46:29.46 25.29% 0+0k 0+0io 0pf+0w
>       merlin14 2471.389u 89.183s 2:46:57.65 25.56% 0+0k 0+0io 0pf+0w
>    Summary of lapw2para:
>    merlin16     user=2510.96    wallclock=166
>    merlin05     user=2493.66    wallclock=166
>    merlin13     user=2484.15    wallclock=166
>    merlin20     user=2479.01    wallclock=166
>    merlin17     user=2461.5     wallclock=166
>    merlin15     user=2474.98    wallclock=166
>    merlin03     user=2544.35    wallclock=167
>    merlin21     user=2510.73    wallclock=166
>    merlin06     user=2519.39    wallclock=167
>    merlin18     user=2529.69    wallclock=167
>    merlin22     user=2468.11    wallclock=167
>    merlin24     user=2473.61    wallclock=166
>    merlin01     user=2495.09    wallclock=167
>    merlin10     user=2438.89    wallclock=166
>    merlin19     user=2521       wallclock=166
>    merlin04     user=2484.32    wallclock=166
>    merlin12     user=2591.03    wallclock=167
>    merlin07     user=2427.4     wallclock=166
>    merlin09     user=2443.94    wallclock=166
>    merlin14     user=2471.39    wallclock=166
> 36.088u 7.790s 2:48:38.21 0.4%  0+0k 24+160io 8pf+0w
>>   lapw2 -c -dn  -p    (11:27:33) running LAPW2 in parallel mode
>       merlin16 2156.929u 95.886s 2:45:56.30 22.63% 0+0k 0+0io 0pf+0w
>       merlin05 2135.272u 84.472s 2:45:27.61 22.36% 0+0k 0+0io 0pf+0w
>       merlin13 2071.188u 78.480s 2:45:21.64 21.67% 0+0k 0+0io 0pf+0w
>       merlin20 2151.438u 104.975s 2:46:04.56 22.64% 0+0k 0+0io 0pf+0w
>       merlin17 2133.444u 97.060s 2:45:30.52 22.46% 0+0k 0+0io 0pf+0w
>       merlin15 2081.077u 81.041s 2:45:48.49 21.73% 0+0k 0+0io 0pf+0w
>       merlin03 2137.855u 86.559s 2:46:28.13 22.27% 0+0k 0+0io 0pf+0w
>       merlin21 2093.567u 101.220s 2:46:17.25 22.00% 0+0k 0+0io 0pf+0w
>       merlin06 2143.250u 97.554s 2:46:54.45 22.38% 0+0k 0+0io 0pf+0w
>       merlin18 2084.752u 105.447s 2:46:26.67 21.93% 0+0k 0+0io 0pf+0w
>       merlin22 2082.295u 99.013s 2:46:43.88 21.80% 0+0k 0+0io 0pf+0w
>       merlin24 2072.152u 95.022s 2:45:44.16 21.79% 0+0k 0+0io 0pf+0w
>       merlin01 2132.420u 91.661s 2:46:35.91 22.25% 0+0k 0+0io 0pf+0w
>       merlin10 2118.587u 94.126s 2:46:39.64 22.13% 0+0k 0+0io 0pf+0w
>       merlin19 2102.943u 92.078s 2:46:20.96 21.99% 0+0k 0+0io 0pf+0w
>       merlin04 2089.082u 85.161s 2:46:19.56 21.79% 0+0k 0+0io 0pf+0w
>       merlin12 2144.932u 87.126s 2:46:19.10 22.37% 0+0k 0+0io 0pf+0w
>       merlin07 2084.597u 83.871s 2:45:57.54 21.78% 0+0k 0+0io 0pf+0w
>       merlin09 2051.034u 75.865s 2:45:49.55 21.38% 0+0k 0+0io 0pf+0w
>       merlin14 2061.305u 87.110s 2:46:01.46 21.57% 0+0k 0+0io 0pf+0w
>    Summary of lapw2para:
>    merlin16     user=2156.93    wallclock=165
>    merlin05     user=2135.27    wallclock=165
>    merlin13     user=2071.19    wallclock=165
>    merlin20     user=2151.44    wallclock=166
>    merlin17     user=2133.44    wallclock=165
>    merlin15     user=2081.08    wallclock=165
>    merlin03     user=2137.86    wallclock=166
>    merlin21     user=2093.57    wallclock=166
>    merlin06     user=2143.25    wallclock=166
>    merlin18     user=2084.75    wallclock=166
>    merlin22     user=2082.3     wallclock=166
>    merlin24     user=2072.15    wallclock=165
>    merlin01     user=2132.42    wallclock=166
>    merlin10     user=2118.59    wallclock=166
>    merlin19     user=2102.94    wallclock=166
>    merlin04     user=2089.08    wallclock=166
>    merlin12     user=2144.93    wallclock=166
>    merlin07     user=2084.6     wallclock=165
>    merlin09     user=2051.03    wallclock=165
>    merlin14     user=2061.3     wallclock=166
> 36.424u 6.018s 2:47:52.05 0.4%  0+0k 0+160io 0pf+0w
>>   lcore -up   (14:15:25) 1.252u 0.307s 0:01.78 87.0%  0+0k 1744+0io
> 8pf+0w
>>   lcore -dn   (14:15:27) 1.259u 0.291s 0:01.75 88.0%  0+0k 8+0io 0pf+0w
>>   mixer       (14:15:35) 12.846u 5.332s 0:22.49 80.7% 0+0k 433168+0io
> 12pf+0w
> :ENERGY convergence:  0 0 0
> :CHARGE convergence:  0 0.00005 0
>
>     cycle 2    (Wed Mar 18 14:15:57 CET 2009)  (499/98 to go
>
> -------------------------------------------------------------------------------------
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>



-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Electron crystallography is the branch of science that uses electron
scattering to study the structure of matter.


More information about the Wien mailing list