[Wien] Low cpu usage on open-mosix Linux cluster

Wed Oct 20 12:02:39 CEST 2004

Dear Torsten and Kevin

Thank you for your suggestions.

So it seems that as long as the number of jobs <= number of processors 
(per node), Wien should be able to run unmodified with local scratch 
partitions. (PS: I mostly run spinpolarized cases.)

To Kevin: I'd appreciate it if I could try your scripts

Thank you

Enrico

Jorissen Kevin wrote:

>There are two ways around it :
> 
>- Like Stefaan does : fix everything by specifying the weights in .machines such that all the k-points are distributed and the lapw1/2/so/para has no choice anymore
>- Like I do : reprogram lapw1/2/so/para so that they read from .processes which nodes were used exactly in the previous job, and then force the current job to do the same.
> 
>Both situations are less flexible than the NFS solution, but the network communication is reduced drastically and the harddisks that are present in the nodes anyway get some exercise.
> 
>If you want to try out my scripts, just let me know.
> 
> 
> 
>Kevin Jorissen
> 
>EMAT - Electron Microscopy for Materials Science   (http://webhost.ua.ac.be/emat/)
>Dept. of Physics
> 
>UA - Universiteit Antwerpen
>Groenenborgerlaan 171
>B-2020 Antwerpen
>Belgium
> 
>tel  +32 3 2653249
>fax + 32 3 2653257
>e-mail kevin.jorissen at ua.ac.be
> 
>
>________________________________
>
>Van: wien-admin at zeus.theochem.tuwien.ac.at namens EB Lombardi
>Verzonden: di 19-10-2004 12:31
>Aan: wien at zeus.theochem.tuwien.ac.at
>Onderwerp: Re: [Wien] Low cpu usage on open-mosix Linux cluster
>
>
>
>Dear Dr Andersen
>
>Thank you for your e-mail
>
>Up to now I have been using the NFS mounted "case" directory as working
>directory - so what you wrote about NFS mounted scratch directories also
>applies here.
>To check, I ran a test with Wien running only on the home node (i.e no
>slow networks involved), which resulted in both lapw1 and lapw2 running
>at 99%.
>
>I have a question regarding local scratch partitions: when lapw1 has run
>on dual processor nodes 1 and 2 (ie k-point parallel over 4 processors),
>leaving case.vector_1 & vector_2 on the scratch partition of node 1,
>while vector_3 and vector_4 are left on node 2. When lapw2 runs, it
>cannot be guaranteed that lapw2 processes will be distributed among the
>nodes in the same order as lapw1 was.  Hence the "first" lapw2 job may
>well run on node 2, but will not find case.vector_1 there. I assume this
>would lead lapw2 to crash? If this is so, is there any way to work
>around this?
>
>About the configuration of the machine: it is a group of dual processor
>PIII, PIV and Xeon machines, grouped together as a mosix cluster using
>Linux version 2.4.22-openmosix2smp. Each node has 4GB RAM, with an NFS
>mounted file system.
>
>Thank you.
>
>Best regards
>
>Enrico
>
>Torsten Andersen wrote:
>
>  
>
>>Dear Mr. Lombardi,
>>
>>well, at least for lapw2, a fast file system (15k-RPM local disks with
>>huge caches and hardware-based RAID-0) is essential to utilizing more
>>than 1% of the CPU time... and if more than one process wants to
>>access the same file system at the same time (e.g., parallel lapw2),
>>this requirement becomes even more essential.
>>
>>If you have problems to get lapw1 to run at 100% CPU-time, the system
>>seems to be seriously misconfigured. I can think of two (there might
>>be more) problems in the setup:
>>
>>1. The scratch partition is NFS-mounted instead of local (and despite
>>many manufacturers claims to the opposite, networked file systems are
>>still VERY SLOW compared to local disks).
>>
>>2. The system memory bandwidth is too slow, e.g., using DDR-266 with
>>Xeons, or the memory is only connected to one CPU on Opterons.
>>
>>In order to "diagnose" a little better we need to know the
>>configuration in detail:-)
>>
>>Best regards,
>>Torsten Andersen.
>>
>>EB Lombardi wrote:
>>
>>    
>>
>>>Dear Wien users
>>>
>>>When I run Wien2k on a Linux-openMosix cluster, lapw1 and lapw2
>>>(k-point parallel) processes mostly use a low percentage of the
>>>available CPU time. Typically only 10-50% of each processor is used,
>>>with values below 10% and above 90% also occuring. On the other hand
>>>single processes, such as lapw0, etc, typically use 99.9% processor
>>>power of one processor. On each node, (number of jobs) = (number of
>>>processors).
>>>
>>>This low CPU utilizatioin does not occur on a dual processor linux
>>>machine, where cpu utilization is mostly 99.9%.
>>>
>>>Any suggestions on improving the CPU utilisation of lapw1c and lapw2
>>>on mosix clusters would be appreciated.
>>>
>>>Regards
>>>
>>>Enrico Lombardi
>>>      
>>>
>>    
>>
>
>_______________________________________________
>Wien mailing list
>Wien at zeus.theochem.tuwien.ac.at
>http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
>
>  
>