[Wien] Low cpu usage on open-mosix Linux cluster

Torsten Andersen thor at physik.uni-kl.de
Tue Oct 19 15:18:49 CEST 2004


Dear Kevin,

I mean, it usually worked for me... however, I must admit that now I am 
using only one machine per case because (1) "they" couldn't make the NFS 
and SGE stable enough to support 8 threads on various machines and (2) 
there is always enough calculations to do.

Anyway, I don't claim to be an expert in this matter, and you probably 
have more experience than I have.

Therefore, to Mr. Lombardi: you should probably follow Kevins approach...

Best regards,
Torsten Andersen.

Jorissen Kevin wrote:
> Dear Torsten,
> what do you mean by "synchronized"?
> Look at the main loop of lapw2para :
>  
>  
> while ($loop < $maxproc)
> 
> set p = 1
> 
> while ($p <= $#machine)
> 
> if ($loop < $maxproc) then
> 
> if !(-e .lock_$lockfile[$p]) then
> 
> @ loop ++
> 
> echo "${loop}:${maxproc}" >.processes2
> 
> touch .lock_$lockfile[$p]
> 
> ($remote $machine[$p] "cd $PWD;$t $exe ${def}_${loop}.def $loop;rm -f .lock_$lockfile[$p]") >>.time2_$loop &
> 
> endif
> 
> @ p ++
> 
> end
> 
> end
> 
>  
> Basically, job number $loop is submitted to the first machine $p in the list (of machines used by lapw1para) that becomes available.
> As soon as the number of jobs (ie, maxproc) becomes larger than the number of nodes (ie, #machine), there is no guarantee that the correct machine will be used.
> The same remark goes for spin polarized and spin orbit calculations.
> The first $#machine jobs will always be on the right node (which is why Stefaan's approach works), but for the other ones, one cannot be sure.
>  
>  
> If I am mistaken somehow, please correct me.
>  
>  
> Kevin Jorissen
>  
> EMAT - Electron Microscopy for Materials Science   (http://webhost.ua.ac.be/emat/)
> Dept. of Physics
>  
> UA - Universiteit Antwerpen
> Groenenborgerlaan 171
> B-2020 Antwerpen
> Belgium
>  
> tel  +32 3 2653249
> fax + 32 3 2653257
> e-mail kevin.jorissen at ua.ac.be
>  
> 
> ________________________________
> 
> Van: wien-admin at zeus.theochem.tuwien.ac.at namens Torsten Andersen
> Verzonden: di 19-10-2004 13:18
> Aan: wien at zeus.theochem.tuwien.ac.at
> Onderwerp: Re: [Wien] Low cpu usage on open-mosix Linux cluster
> 
> 
> 
> Dear Mr. Lombardi,
> 
> EB Lombardi wrote:
> 
>>Dear Dr Andersen
>>
>>Thank you for your e-mail
>>
>>Up to now I have been using the NFS mounted "case" directory as working
>>directory - so what you wrote about NFS mounted scratch directories also
>>applies here.
>>To check, I ran a test with Wien running only on the home node (i.e no
>>slow networks involved), which resulted in both lapw1 and lapw2 running
>>at 99%.
> 
> 
> So the NFS is the problem...
> 
> 
>>I have a question regarding local scratch partitions: when lapw1 has run
>>on dual processor nodes 1 and 2 (ie k-point parallel over 4 processors),
>>leaving case.vector_1 & vector_2 on the scratch partition of node 1,
>>while vector_3 and vector_4 are left on node 2. When lapw2 runs, it
>>cannot be guaranteed that lapw2 processes will be distributed among the
>>nodes in the same order as lapw1 was.  Hence the "first" lapw2 job may
>>well run on node 2, but will not find case.vector_1 there. I assume this
>>would lead lapw2 to crash? If this is so, is there any way to work
>>around this?
> 
> 
> In newer versions of Wien2k (at least in Wien2k_02 and up), lapw1,
> lapwso, and lapw2 are "synchronized" with respect to the machines. See
> the .machine* files in the case directory. The only constraint is that
> the scratch directory is in the same location on all your machines
> (since your .cshrc, .login, .profile is the same on all nodes - of
> course this can be tuned individually, but...).
> 
> 
>>About the configuration of the machine: it is a group of dual processor
>>PIII, PIV and Xeon machines, grouped together as a mosix cluster using
>>Linux version 2.4.22-openmosix2smp. Each node has 4GB RAM, with an NFS
>>mounted file system.
> 
> 
> If you change the scratch partitions to local, everything should be ok.
> 
> Best regards,
> Torsten Andersen.
> 
> 
>>Thank you.
>>
>>Best regards
>>
>>Enrico
>>
>>Torsten Andersen wrote:
>>
>>
>>>Dear Mr. Lombardi,
>>>
>>>well, at least for lapw2, a fast file system (15k-RPM local disks with
>>>huge caches and hardware-based RAID-0) is essential to utilizing more
>>>than 1% of the CPU time... and if more than one process wants to
>>>access the same file system at the same time (e.g., parallel lapw2),
>>>this requirement becomes even more essential.
>>>
>>>If you have problems to get lapw1 to run at 100% CPU-time, the system
>>>seems to be seriously misconfigured. I can think of two (there might
>>>be more) problems in the setup:
>>>
>>>1. The scratch partition is NFS-mounted instead of local (and despite
>>>many manufacturers claims to the opposite, networked file systems are
>>>still VERY SLOW compared to local disks).
>>>
>>>2. The system memory bandwidth is too slow, e.g., using DDR-266 with
>>>Xeons, or the memory is only connected to one CPU on Opterons.
>>>
>>>In order to "diagnose" a little better we need to know the
>>>configuration in detail:-)
>>>
>>>Best regards,
>>>Torsten Andersen.
>>>
>>>EB Lombardi wrote:
>>>
>>>
>>>>Dear Wien users
>>>>
>>>>When I run Wien2k on a Linux-openMosix cluster, lapw1 and lapw2
>>>>(k-point parallel) processes mostly use a low percentage of the
>>>>available CPU time. Typically only 10-50% of each processor is used,
>>>>with values below 10% and above 90% also occuring. On the other hand
>>>>single processes, such as lapw0, etc, typically use 99.9% processor
>>>>power of one processor. On each node, (number of jobs) = (number of
>>>>processors).
>>>>
>>>>This low CPU utilizatioin does not occur on a dual processor linux
>>>>machine, where cpu utilization is mostly 99.9%.
>>>>
>>>>Any suggestions on improving the CPU utilisation of lapw1c and lapw2
>>>>on mosix clusters would be appreciated.
>>>>
>>>>Regards
>>>>
>>>>Enrico Lombardi
>>>
>>>
>>>
>>>
>>_______________________________________________
>>Wien mailing list
>>Wien at zeus.theochem.tuwien.ac.at
>>http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
> 
> 
> --
> Dr. Torsten Andersen        TA-web: http://deep.at/myspace/
> AG Hübner, Department of Physics, Kaiserslautern University
> http://cmt.physik.uni-kl.de    http://www.physik.uni-kl.de/
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> 
> 

-- 
Dr. Torsten Andersen        TA-web: http://deep.at/myspace/
AG Hübner, Department of Physics, Kaiserslautern University
http://cmt.physik.uni-kl.de    http://www.physik.uni-kl.de/




More information about the Wien mailing list