[Wien] Low cpu usage on open-mosix Linux cluster
EB Lombardi
lombaeb at science.unisa.ac.za
Tue Oct 19 12:31:58 CEST 2004
Dear Dr Andersen
Thank you for your e-mail
Up to now I have been using the NFS mounted "case" directory as working
directory - so what you wrote about NFS mounted scratch directories also
applies here.
To check, I ran a test with Wien running only on the home node (i.e no
slow networks involved), which resulted in both lapw1 and lapw2 running
at 99%.
I have a question regarding local scratch partitions: when lapw1 has run
on dual processor nodes 1 and 2 (ie k-point parallel over 4 processors),
leaving case.vector_1 & vector_2 on the scratch partition of node 1,
while vector_3 and vector_4 are left on node 2. When lapw2 runs, it
cannot be guaranteed that lapw2 processes will be distributed among the
nodes in the same order as lapw1 was. Hence the "first" lapw2 job may
well run on node 2, but will not find case.vector_1 there. I assume this
would lead lapw2 to crash? If this is so, is there any way to work
around this?
About the configuration of the machine: it is a group of dual processor
PIII, PIV and Xeon machines, grouped together as a mosix cluster using
Linux version 2.4.22-openmosix2smp. Each node has 4GB RAM, with an NFS
mounted file system.
Thank you.
Best regards
Enrico
Torsten Andersen wrote:
> Dear Mr. Lombardi,
>
> well, at least for lapw2, a fast file system (15k-RPM local disks with
> huge caches and hardware-based RAID-0) is essential to utilizing more
> than 1% of the CPU time... and if more than one process wants to
> access the same file system at the same time (e.g., parallel lapw2),
> this requirement becomes even more essential.
>
> If you have problems to get lapw1 to run at 100% CPU-time, the system
> seems to be seriously misconfigured. I can think of two (there might
> be more) problems in the setup:
>
> 1. The scratch partition is NFS-mounted instead of local (and despite
> many manufacturers claims to the opposite, networked file systems are
> still VERY SLOW compared to local disks).
>
> 2. The system memory bandwidth is too slow, e.g., using DDR-266 with
> Xeons, or the memory is only connected to one CPU on Opterons.
>
> In order to "diagnose" a little better we need to know the
> configuration in detail:-)
>
> Best regards,
> Torsten Andersen.
>
> EB Lombardi wrote:
>
>> Dear Wien users
>>
>> When I run Wien2k on a Linux-openMosix cluster, lapw1 and lapw2
>> (k-point parallel) processes mostly use a low percentage of the
>> available CPU time. Typically only 10-50% of each processor is used,
>> with values below 10% and above 90% also occuring. On the other hand
>> single processes, such as lapw0, etc, typically use 99.9% processor
>> power of one processor. On each node, (number of jobs) = (number of
>> processors).
>>
>> This low CPU utilizatioin does not occur on a dual processor linux
>> machine, where cpu utilization is mostly 99.9%.
>>
>> Any suggestions on improving the CPU utilisation of lapw1c and lapw2
>> on mosix clusters would be appreciated.
>>
>> Regards
>>
>> Enrico Lombardi
>
>
>
More information about the Wien
mailing list