[Wien] Low cpu usage on open-mosix Linux cluster
Jorissen Kevin
Kevin.Jorissen at ua.ac.be
Tue Oct 19 14:34:48 CEST 2004
Dear Torsten,
what do you mean by "synchronized"?
Look at the main loop of lapw2para :
while ($loop < $maxproc)
set p = 1
while ($p <= $#machine)
if ($loop < $maxproc) then
if !(-e .lock_$lockfile[$p]) then
@ loop ++
echo "${loop}:${maxproc}" >.processes2
touch .lock_$lockfile[$p]
($remote $machine[$p] "cd $PWD;$t $exe ${def}_${loop}.def $loop;rm -f .lock_$lockfile[$p]") >>.time2_$loop &
endif
@ p ++
end
end
Basically, job number $loop is submitted to the first machine $p in the list (of machines used by lapw1para) that becomes available.
As soon as the number of jobs (ie, maxproc) becomes larger than the number of nodes (ie, #machine), there is no guarantee that the correct machine will be used.
The same remark goes for spin polarized and spin orbit calculations.
The first $#machine jobs will always be on the right node (which is why Stefaan's approach works), but for the other ones, one cannot be sure.
If I am mistaken somehow, please correct me.
Kevin Jorissen
EMAT - Electron Microscopy for Materials Science (http://webhost.ua.ac.be/emat/)
Dept. of Physics
UA - Universiteit Antwerpen
Groenenborgerlaan 171
B-2020 Antwerpen
Belgium
tel +32 3 2653249
fax + 32 3 2653257
e-mail kevin.jorissen at ua.ac.be
________________________________
Van: wien-admin at zeus.theochem.tuwien.ac.at namens Torsten Andersen
Verzonden: di 19-10-2004 13:18
Aan: wien at zeus.theochem.tuwien.ac.at
Onderwerp: Re: [Wien] Low cpu usage on open-mosix Linux cluster
Dear Mr. Lombardi,
EB Lombardi wrote:
> Dear Dr Andersen
>
> Thank you for your e-mail
>
> Up to now I have been using the NFS mounted "case" directory as working
> directory - so what you wrote about NFS mounted scratch directories also
> applies here.
> To check, I ran a test with Wien running only on the home node (i.e no
> slow networks involved), which resulted in both lapw1 and lapw2 running
> at 99%.
So the NFS is the problem...
>
> I have a question regarding local scratch partitions: when lapw1 has run
> on dual processor nodes 1 and 2 (ie k-point parallel over 4 processors),
> leaving case.vector_1 & vector_2 on the scratch partition of node 1,
> while vector_3 and vector_4 are left on node 2. When lapw2 runs, it
> cannot be guaranteed that lapw2 processes will be distributed among the
> nodes in the same order as lapw1 was. Hence the "first" lapw2 job may
> well run on node 2, but will not find case.vector_1 there. I assume this
> would lead lapw2 to crash? If this is so, is there any way to work
> around this?
In newer versions of Wien2k (at least in Wien2k_02 and up), lapw1,
lapwso, and lapw2 are "synchronized" with respect to the machines. See
the .machine* files in the case directory. The only constraint is that
the scratch directory is in the same location on all your machines
(since your .cshrc, .login, .profile is the same on all nodes - of
course this can be tuned individually, but...).
>
> About the configuration of the machine: it is a group of dual processor
> PIII, PIV and Xeon machines, grouped together as a mosix cluster using
> Linux version 2.4.22-openmosix2smp. Each node has 4GB RAM, with an NFS
> mounted file system.
If you change the scratch partitions to local, everything should be ok.
Best regards,
Torsten Andersen.
>
> Thank you.
>
> Best regards
>
> Enrico
>
> Torsten Andersen wrote:
>
>> Dear Mr. Lombardi,
>>
>> well, at least for lapw2, a fast file system (15k-RPM local disks with
>> huge caches and hardware-based RAID-0) is essential to utilizing more
>> than 1% of the CPU time... and if more than one process wants to
>> access the same file system at the same time (e.g., parallel lapw2),
>> this requirement becomes even more essential.
>>
>> If you have problems to get lapw1 to run at 100% CPU-time, the system
>> seems to be seriously misconfigured. I can think of two (there might
>> be more) problems in the setup:
>>
>> 1. The scratch partition is NFS-mounted instead of local (and despite
>> many manufacturers claims to the opposite, networked file systems are
>> still VERY SLOW compared to local disks).
>>
>> 2. The system memory bandwidth is too slow, e.g., using DDR-266 with
>> Xeons, or the memory is only connected to one CPU on Opterons.
>>
>> In order to "diagnose" a little better we need to know the
>> configuration in detail:-)
>>
>> Best regards,
>> Torsten Andersen.
>>
>> EB Lombardi wrote:
>>
>>> Dear Wien users
>>>
>>> When I run Wien2k on a Linux-openMosix cluster, lapw1 and lapw2
>>> (k-point parallel) processes mostly use a low percentage of the
>>> available CPU time. Typically only 10-50% of each processor is used,
>>> with values below 10% and above 90% also occuring. On the other hand
>>> single processes, such as lapw0, etc, typically use 99.9% processor
>>> power of one processor. On each node, (number of jobs) = (number of
>>> processors).
>>>
>>> This low CPU utilizatioin does not occur on a dual processor linux
>>> machine, where cpu utilization is mostly 99.9%.
>>>
>>> Any suggestions on improving the CPU utilisation of lapw1c and lapw2
>>> on mosix clusters would be appreciated.
>>>
>>> Regards
>>>
>>> Enrico Lombardi
>>
>>
>>
>>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
--
Dr. Torsten Andersen TA-web: http://deep.at/myspace/
AG Hübner, Department of Physics, Kaiserslautern University
http://cmt.physik.uni-kl.de http://www.physik.uni-kl.de/
_______________________________________________
Wien mailing list
Wien at zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 10541 bytes
Desc: not available
Url : http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20041019/e05c4e99/attachment.bin
More information about the Wien
mailing list