[Wien] excessive bandwidth usage

Anne-Christine Uldry anne-christine.uldry at psi.ch
Thu Apr 2 17:49:57 CEST 2009


Dear Wien2k users,

I am wondering if anyone could comment or give me some pointers or general 
info regarding the way wien2k deals with input/output with the 
case.vector* files.

I have recently tried to run -admittedly oversized- wien2k calculations on 
our local cluster (details below). By the time my calculations reached 
cycle 2 of the SCF, other users were complaining (and quite rightly so) 
that I was taking up all the communication bandwidth doing I/O with the 
several gigabytes worth of case.vector* files.

This particular calculation is probably not appropriate for this cluster, 
and/or badly set up.

My first question is however this:
Is there something one can do to limit reading/writing to the 
vector files in some instances; in my case it looked like the vectors 
could be held in memory instead of being written down. Are there variables 
or settings that could be adjusted to prevent checkpointing ?

My second question, if anyone can be bothered to look at the details, is:
Could I set up this calculation in a better way at all ?


I am running wien2k version 09.1 on a fairly standard cluster. It has 24 
compute nodes of 8 GB of RAM, carrying two AMD Opteron dual cores (2.4 
GHz), and a one gigabit interconnect. The operating system is Scientific 
linux 4. Maybe I should also mentionned that I have set ulimit -v to 
2000000 and ulimit -m to 2097152. OMP_NUM_THREADS is set to 1, and SCRATCH 
points to our scratch disk (don't know the size, but I never had a storage 
problem there).
I compiled for k-point parallelisation only, using intel compiler 10.0 64 
bit and the Intel-MKL libraries. NMATMAX had been set to 13000, NUME to 
1000.
The system I looked at has 54 independent magnetic atoms in a 4129 au^3 
supercell (no inversion symmetry). The matrix size was 6199 and I had 500 
kpts. I requested 20 slots (out of a maximum of 32). The .machines file 
looked like this:

granularity:1

plus 20 lines like this:

1:merlinXX:1

The command issued was "runsp_lapw -p".
The case.dayfile is reproduced below for the first cycle. Note that during 
lapw1 -c the processors seem to be used to close to 100 percent, while in 
lapw2 -c I get more like 20 percent.


Many thanks and best wishes,
Anne-Christine Uldry




-------------------------case.dayfile------------------------------------



     start 	(Tue Mar 17 09:59:33 CET 2009) with lapw0 (500/99 to go)

     cycle 1 	(Tue Mar 17 09:59:33 CET 2009) 	(500/99 to go)

>   lapw0 -p	(09:59:33) starting parallel lapw0 at Tue Mar 17 09:59:33 
CET 2009
-------- .machine0 : processors
running lapw0 in single mode
343.032u 3.523s 5:48.37 99.4%	0+0k 24+0io 13pf+0w
>   lapw1  -c -up -p  	(10:05:22) starting parallel lapw1 at Tue Mar 17 
10:05:22 CET 2009
->  starting parallel LAPW1 jobs at Tue Mar 17 10:05:23 CET 2009
running LAPW1 in parallel mode (using .machines)
20 number_of_parallel_jobs
      merlin16(25) 37495.767u 154.792s 10:27:55.51 99.93%      0+0k 0+0io 
0pf+0w
      merlin05(25) 40045.286u 55.558s 11:08:35.54 99.96%      0+0k 0+0io 
0pf+0w
      merlin13(25) 38207.258u 55.456s 10:38:03.97 99.94%      0+0k 0+0io 
0pf+0w
      merlin20(25) 37391.182u 153.745s 10:25:58.52 99.96%      0+0k 0+0io 
0pf+0w
      merlin17(25) 37232.878u 147.786s 10:23:11.98 99.97%      0+0k 0+0io 
0pf+0w
      merlin15(25) 37528.695u 153.319s 10:28:14.53 99.97%      0+0k 0+0io 
0pf+0w
      merlin03(25) 38355.185u 53.653s 10:40:24.28 99.96%      0+0k 0+0io 
0pf+0w
      merlin21(25) 37258.630u 153.113s 10:23:42.03 99.97%      0+0k 0+0io 
0pf+0w
      merlin06(25) 38954.480u 49.951s 10:50:16.45 99.97%      0+0k 0+0io 
0pf+0w
      merlin18(25) 38570.612u 166.023s 10:45:49.15 99.97%      0+0k 0+0io 
0pf+0w
      merlin22(25) 38776.189u 182.754s 10:49:48.24 99.92%      0+0k 0+0io 
0pf+0w
      merlin24(25) 37065.451u 160.089s 10:20:35.9 99.97%      0+0k 0+0io 
0pf+0w
      merlin01(25) 39738.144u 77.905s 11:04:15.48 99.90%      0+0k 0+0io 
0pf+0w
      merlin10(25) 39316.326u 51.954s 10:56:19.00 99.97%      0+0k 0+0io 
0pf+0w
      merlin19(25) 40010.995u 155.071s 11:09:38.35 99.97%      0+0k 0+0io 
0pf+0w
      merlin04(25) 38738.890u 52.391s 10:46:42.77 99.97%      0+0k 0+0io 
0pf+0w
      merlin12(25) 39349.068u 57.359s 10:57:09.63 99.94%      0+0k 0+0io 
0pf+0w
      merlin07(25) 38638.182u 51.192s 10:45:02.37 99.97%      0+0k 0+0io 
0pf+0w
      merlin09(25) 39349.753u 120.713s 10:58:03.42 99.97%      0+0k 0+0io 
0pf+0w
      merlin14(25) 39202.712u 68.919s 10:54:52.43 99.95%      0+0k 0+0io 
0pf+0w
    Summary of lapw1para:
    merlin16	 k=25	 user=37495.8	 wallclock=627
    merlin05	 k=25	 user=40045.3	 wallclock=668
    merlin13	 k=25	 user=38207.3	 wallclock=638
    merlin20	 k=25	 user=37391.2	 wallclock=625
    merlin17	 k=25	 user=37232.9	 wallclock=623
    merlin15	 k=25	 user=37528.7	 wallclock=628
    merlin03	 k=25	 user=38355.2	 wallclock=640
    merlin21	 k=25	 user=37258.6	 wallclock=623
    merlin06	 k=25	 user=38954.5	 wallclock=650
    merlin18	 k=25	 user=38570.6	 wallclock=645
    merlin22	 k=25	 user=38776.2	 wallclock=649
    merlin24	 k=25	 user=37065.5	 wallclock=620
    merlin01	 k=25	 user=39738.1	 wallclock=664
    merlin10	 k=25	 user=39316.3	 wallclock=656
    merlin19	 k=25	 user=40011	 wallclock=669
    merlin04	 k=25	 user=38738.9	 wallclock=646
    merlin12	 k=25	 user=39349.1	 wallclock=657
    merlin07	 k=25	 user=38638.2	 wallclock=645
    merlin09	 k=25	 user=39349.8	 wallclock=658
    merlin14	 k=25	 user=39202.7	 wallclock=654
17.098u 72.346s 11:09:57.45 0.2%	0+0k 40+320io 0pf+0w
>   lapw1  -c -dn -p  	(21:15:19) starting parallel lapw1 at Tue Mar 17 
21:15:19 CET 2009
->  starting parallel LAPW1 jobs at Tue Mar 17 21:15:20 CET 2009
running LAPW1 in parallel mode (using .machines.help)
20 number_of_parallel_jobs
      merlin16(25) 37751.881u 155.731s 10:32:11.94 99.94%      0+0k 0+0io 
0pf+0w
      merlin05(25) 39472.295u 104.662s 10:59:52.02 99.96%      0+0k 0+0io 
0pf+0w
      merlin13(25) 37835.579u 55.090s 10:31:56.91 99.93%      0+0k 0+0io 
0pf+0w
      merlin20(25) 37918.886u 184.326s 10:35:26.09 99.94%      0+0k 0+0io 
0pf+0w
      merlin17(25) 37265.070u 403.814s 10:28:33.48 99.88%      0+0k 0+0io 
0pf+0w
      merlin15(25) 37835.077u 157.843s 10:33:24.46 99.97%      0+0k 0+0io 
0pf+0w
      merlin03(25) 40200.683u 76.920s 11:11:48.85 99.92%      0+0k 0+0io 
0pf+0w
      merlin21(25) 38055.082u 157.867s 10:37:08.89 99.96%      0+0k 0+0io 
0pf+0w
      merlin06(25) 38471.013u 59.757s 10:42:41.02 99.92%      0+0k 0+0io 
0pf+0w
      merlin18(25) 39028.706u 160.356s 10:53:24.49 99.96%      0+0k 0+0io 
0pf+0w
      merlin22(25) 39672.912u 157.281s 11:04:14.35 99.94%      0+0k 0+0io 
0pf+0w
      merlin24(25) 37535.676u 244.072s 10:30:33.84 99.86%      0+0k 0+0io 
0pf+0w
      merlin01(25) 40867.680u 87.081s 11:23:18.53 99.89%      0+0k 0+0io 
0pf+0w
      merlin10(25) 38712.416u 52.325s 10:46:19.60 99.96%      0+0k 0+0io 
0pf+0w
      merlin19(25) 38589.740u 161.574s 10:46:03.82 99.97%      0+0k 0+0io 
0pf+0w
      merlin04(25) 38711.808u 53.267s 10:46:18.66 99.96%      0+0k 0+0io 
0pf+0w
      merlin12(25) 39539.575u 55.224s 11:16.21 99.95%      0+0k 0+0io 
0pf+0w
      merlin07(25) 37432.873u 48.500s 10:24:56.60 99.96%      0+0k 0+0io 
0pf+0w
      merlin09(25) 39396.568u 128.451s 10:59:07.64 99.94%      0+0k 0+0io 
0pf+0w
      merlin14(25) 39549.837u 79.790s 11:01:06.82 99.91%      0+0k 0+0io 
0pf+0w
    Summary of lapw1para:
    merlin16	 k=25	 user=37751.9	 wallclock=632
    merlin05	 k=25	 user=39472.3	 wallclock=659
    merlin13	 k=25	 user=37835.6	 wallclock=631
    merlin20	 k=25	 user=37918.9	 wallclock=635
    merlin17	 k=25	 user=37265.1	 wallclock=628
    merlin15	 k=25	 user=37835.1	 wallclock=633
    merlin03	 k=25	 user=40200.7	 wallclock=671
    merlin21	 k=25	 user=38055.1	 wallclock=637
    merlin06	 k=25	 user=38471	 wallclock=642
    merlin18	 k=25	 user=39028.7	 wallclock=653
    merlin22	 k=25	 user=39672.9	 wallclock=664
    merlin24	 k=25	 user=37535.7	 wallclock=630
    merlin01	 k=25	 user=40867.7	 wallclock=683
    merlin10	 k=25	 user=38712.4	 wallclock=646
    merlin19	 k=25	 user=38589.7	 wallclock=646
    merlin04	 k=25	 user=38711.8	 wallclock=646
    merlin12	 k=25	 user=39539.6	 wallclock=676.21
    merlin07	 k=25	 user=37432.9	 wallclock=624
    merlin09	 k=25	 user=39396.6	 wallclock=659
    merlin14	 k=25	 user=39549.8	 wallclock=661
17.925u 74.156s 11:23:35.72 0.2%	0+0k 0+320io 0pf+0w
>   lapw2 -c -up  -p 	(08:38:55) running LAPW2 in parallel mode
       merlin16 2510.963u 91.613s 2:46:13.78 26.09% 0+0k 0+0io 0pf+0w
       merlin05 2493.665u 79.846s 2:46:09.00 25.82% 0+0k 0+0io 0pf+0w
       merlin13 2484.150u 78.296s 2:46:03.32 25.72% 0+0k 0+0io 0pf+0w
       merlin20 2479.015u 104.637s 2:46:24.73 25.88% 0+0k 0+0io 0pf+0w
       merlin17 2461.499u 103.356s 2:46:18.27 25.70% 0+0k 0+0io 0pf+0w
       merlin15 2474.984u 88.321s 2:46:14.32 25.70% 0+0k 0+0io 0pf+0w
       merlin03 2544.354u 86.533s 2:47:15.34 26.22% 0+0k 0+0io 0pf+0w
       merlin21 2510.726u 105.354s 2:46:52.52 26.13% 0+0k 0+0io 0pf+0w
       merlin06 2519.390u 87.796s 2:47:29.42 25.94% 0+0k 0+0io 0pf+0w
       merlin18 2529.690u 122.496s 2:47:29.15 26.39% 0+0k 0+0io 0pf+0w
       merlin22 2468.111u 114.877s 2:47:14.86 25.74% 0+0k 0+0io 0pf+0w
       merlin24 2473.606u 112.375s 2:46:11.20 25.93% 0+0k 0+0io 0pf+0w
       merlin01 2495.088u 93.868s 2:47:32.19 25.76% 0+0k 0+0io 0pf+0w
       merlin10 2438.887u 77.412s 2:46:59.86 25.11% 0+0k 0+0io 0pf+0w
       merlin19 2521.003u 95.387s 2:46:56.37 26.12% 0+0k 0+0io 0pf+0w
       merlin04 2484.324u 78.449s 2:46:46.20 25.61% 0+0k 0+0io 0pf+0w
       merlin12 2591.029u 95.105s 2:47:18.48 26.76% 0+0k 0+0io 0pf+0w
       merlin07 2427.397u 77.359s 2:46:30.55 25.07% 0+0k 0+0io 0pf+0w
       merlin09 2443.942u 82.523s 2:46:29.46 25.29% 0+0k 0+0io 0pf+0w
       merlin14 2471.389u 89.183s 2:46:57.65 25.56% 0+0k 0+0io 0pf+0w
    Summary of lapw2para:
    merlin16	 user=2510.96	 wallclock=166
    merlin05	 user=2493.66	 wallclock=166
    merlin13	 user=2484.15	 wallclock=166
    merlin20	 user=2479.01	 wallclock=166
    merlin17	 user=2461.5	 wallclock=166
    merlin15	 user=2474.98	 wallclock=166
    merlin03	 user=2544.35	 wallclock=167
    merlin21	 user=2510.73	 wallclock=166
    merlin06	 user=2519.39	 wallclock=167
    merlin18	 user=2529.69	 wallclock=167
    merlin22	 user=2468.11	 wallclock=167
    merlin24	 user=2473.61	 wallclock=166
    merlin01	 user=2495.09	 wallclock=167
    merlin10	 user=2438.89	 wallclock=166
    merlin19	 user=2521	 wallclock=166
    merlin04	 user=2484.32	 wallclock=166
    merlin12	 user=2591.03	 wallclock=167
    merlin07	 user=2427.4	 wallclock=166
    merlin09	 user=2443.94	 wallclock=166
    merlin14	 user=2471.39	 wallclock=166
36.088u 7.790s 2:48:38.21 0.4%	0+0k 24+160io 8pf+0w
>   lapw2 -c -dn  -p 	(11:27:33) running LAPW2 in parallel mode
       merlin16 2156.929u 95.886s 2:45:56.30 22.63% 0+0k 0+0io 0pf+0w
       merlin05 2135.272u 84.472s 2:45:27.61 22.36% 0+0k 0+0io 0pf+0w
       merlin13 2071.188u 78.480s 2:45:21.64 21.67% 0+0k 0+0io 0pf+0w
       merlin20 2151.438u 104.975s 2:46:04.56 22.64% 0+0k 0+0io 0pf+0w
       merlin17 2133.444u 97.060s 2:45:30.52 22.46% 0+0k 0+0io 0pf+0w
       merlin15 2081.077u 81.041s 2:45:48.49 21.73% 0+0k 0+0io 0pf+0w
       merlin03 2137.855u 86.559s 2:46:28.13 22.27% 0+0k 0+0io 0pf+0w
       merlin21 2093.567u 101.220s 2:46:17.25 22.00% 0+0k 0+0io 0pf+0w
       merlin06 2143.250u 97.554s 2:46:54.45 22.38% 0+0k 0+0io 0pf+0w
       merlin18 2084.752u 105.447s 2:46:26.67 21.93% 0+0k 0+0io 0pf+0w
       merlin22 2082.295u 99.013s 2:46:43.88 21.80% 0+0k 0+0io 0pf+0w
       merlin24 2072.152u 95.022s 2:45:44.16 21.79% 0+0k 0+0io 0pf+0w
       merlin01 2132.420u 91.661s 2:46:35.91 22.25% 0+0k 0+0io 0pf+0w
       merlin10 2118.587u 94.126s 2:46:39.64 22.13% 0+0k 0+0io 0pf+0w
       merlin19 2102.943u 92.078s 2:46:20.96 21.99% 0+0k 0+0io 0pf+0w
       merlin04 2089.082u 85.161s 2:46:19.56 21.79% 0+0k 0+0io 0pf+0w
       merlin12 2144.932u 87.126s 2:46:19.10 22.37% 0+0k 0+0io 0pf+0w
       merlin07 2084.597u 83.871s 2:45:57.54 21.78% 0+0k 0+0io 0pf+0w
       merlin09 2051.034u 75.865s 2:45:49.55 21.38% 0+0k 0+0io 0pf+0w
       merlin14 2061.305u 87.110s 2:46:01.46 21.57% 0+0k 0+0io 0pf+0w
    Summary of lapw2para:
    merlin16	 user=2156.93	 wallclock=165
    merlin05	 user=2135.27	 wallclock=165
    merlin13	 user=2071.19	 wallclock=165
    merlin20	 user=2151.44	 wallclock=166
    merlin17	 user=2133.44	 wallclock=165
    merlin15	 user=2081.08	 wallclock=165
    merlin03	 user=2137.86	 wallclock=166
    merlin21	 user=2093.57	 wallclock=166
    merlin06	 user=2143.25	 wallclock=166
    merlin18	 user=2084.75	 wallclock=166
    merlin22	 user=2082.3	 wallclock=166
    merlin24	 user=2072.15	 wallclock=165
    merlin01	 user=2132.42	 wallclock=166
    merlin10	 user=2118.59	 wallclock=166
    merlin19	 user=2102.94	 wallclock=166
    merlin04	 user=2089.08	 wallclock=166
    merlin12	 user=2144.93	 wallclock=166
    merlin07	 user=2084.6	 wallclock=165
    merlin09	 user=2051.03	 wallclock=165
    merlin14	 user=2061.3	 wallclock=166
36.424u 6.018s 2:47:52.05 0.4%	0+0k 0+160io 0pf+0w
>   lcore -up	(14:15:25) 1.252u 0.307s 0:01.78 87.0%	0+0k 1744+0io 
8pf+0w
>   lcore -dn	(14:15:27) 1.259u 0.291s 0:01.75 88.0%	0+0k 8+0io 0pf+0w
>   mixer 	(14:15:35) 12.846u 5.332s 0:22.49 80.7%	0+0k 433168+0io 
12pf+0w
:ENERGY convergence:  0 0 0
:CHARGE convergence:  0 0.00005 0

     cycle 2 	(Wed Mar 18 14:15:57 CET 2009) 	(499/98 to go

-------------------------------------------------------------------------------------


More information about the Wien mailing list