[Wien] Low cpu usage on open-mosix Linux cluster

Tue Oct 19 13:48:21 CEST 2004

There are two ways around it :

- Like Stefaan does : fix everything by specifying the weights in .machines such that all the k-points are distributed and the lapw1/2/so/para has no choice anymore
- Like I do : reprogram lapw1/2/so/para so that they read from .processes which nodes were used exactly in the previous job, and then force the current job to do the same.

Both situations are less flexible than the NFS solution, but the network communication is reduced drastically and the harddisks that are present in the nodes anyway get some exercise.

If you want to try out my scripts, just let me know.

Kevin Jorissen

EMAT - Electron Microscopy for Materials Science   (http://webhost.ua.ac.be/emat/)
Dept. of Physics

UA - Universiteit Antwerpen
Groenenborgerlaan 171
B-2020 Antwerpen
Belgium

tel  +32 3 2653249
fax + 32 3 2653257
e-mail kevin.jorissen at ua.ac.be

________________________________

Van: wien-admin at zeus.theochem.tuwien.ac.at namens EB Lombardi
Verzonden: di 19-10-2004 12:31
Aan: wien at zeus.theochem.tuwien.ac.at
Onderwerp: Re: [Wien] Low cpu usage on open-mosix Linux cluster

Dear Dr Andersen

Thank you for your e-mail

Up to now I have been using the NFS mounted "case" directory as working
directory - so what you wrote about NFS mounted scratch directories also
applies here.
To check, I ran a test with Wien running only on the home node (i.e no
slow networks involved), which resulted in both lapw1 and lapw2 running
at 99%.

I have a question regarding local scratch partitions: when lapw1 has run
on dual processor nodes 1 and 2 (ie k-point parallel over 4 processors),
leaving case.vector_1 & vector_2 on the scratch partition of node 1,
while vector_3 and vector_4 are left on node 2. When lapw2 runs, it
cannot be guaranteed that lapw2 processes will be distributed among the
nodes in the same order as lapw1 was.  Hence the "first" lapw2 job may
well run on node 2, but will not find case.vector_1 there. I assume this
would lead lapw2 to crash? If this is so, is there any way to work
around this?

About the configuration of the machine: it is a group of dual processor
PIII, PIV and Xeon machines, grouped together as a mosix cluster using
Linux version 2.4.22-openmosix2smp. Each node has 4GB RAM, with an NFS
mounted file system.

Thank you.

Best regards

Enrico

Torsten Andersen wrote:

> Dear Mr. Lombardi,
>
> well, at least for lapw2, a fast file system (15k-RPM local disks with
> huge caches and hardware-based RAID-0) is essential to utilizing more
> than 1% of the CPU time... and if more than one process wants to
> access the same file system at the same time (e.g., parallel lapw2),
> this requirement becomes even more essential.
>
> If you have problems to get lapw1 to run at 100% CPU-time, the system
> seems to be seriously misconfigured. I can think of two (there might
> be more) problems in the setup:
>
> 1. The scratch partition is NFS-mounted instead of local (and despite
> many manufacturers claims to the opposite, networked file systems are
> still VERY SLOW compared to local disks).
>
> 2. The system memory bandwidth is too slow, e.g., using DDR-266 with
> Xeons, or the memory is only connected to one CPU on Opterons.
>
> In order to "diagnose" a little better we need to know the
> configuration in detail:-)
>
> Best regards,
> Torsten Andersen.
>
> EB Lombardi wrote:
>
>> Dear Wien users
>>
>> When I run Wien2k on a Linux-openMosix cluster, lapw1 and lapw2
>> (k-point parallel) processes mostly use a low percentage of the
>> available CPU time. Typically only 10-50% of each processor is used,
>> with values below 10% and above 90% also occuring. On the other hand
>> single processes, such as lapw0, etc, typically use 99.9% processor
>> power of one processor. On each node, (number of jobs) = (number of
>> processors).
>>
>> This low CPU utilizatioin does not occur on a dual processor linux
>> machine, where cpu utilization is mostly 99.9%.
>>
>> Any suggestions on improving the CPU utilisation of lapw1c and lapw2
>> on mosix clusters would be appreciated.
>>
>> Regards
>>
>> Enrico Lombardi
>
>
>

_______________________________________________
Wien mailing list
Wien at zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 8277 bytes
Desc: not available
Url : http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20041019/34fb28e9/attachment.bin