[Wien] Problem in running k-point parallel jobs
Torsten Andersen
thor at physik.uni-kl.de
Tue Aug 15 14:46:07 CEST 2006
Hello,
well, this would be very inefficient! Forget MPI if you have enough
k-points to do k-point parallelization.
There has to be a list somewhere (or your sysadm has to create it for
you) in the queue system of the machines you have been allocated. You
can then with this list and a bit of scripting create a suitable
.machines file.
On one of the clusters I use, the list of allocated CPU's is in
$TMPDIR/machines, and I build a .machines file in the submitted
job-shell like this:
<----
# Build the Wien2k ".machines" file - this should be rebuilt every time
# The allocated CPU's are listed in $TMPDIR/machines
if (-e .machines) rm -f .machines
echo "granularity:1" > .machines
echo "extrafine:1" >> .machines
sed 's/aix/1:aix/g' $TMPDIR/machines >> .machines
<----
where $TMPDIR/machines could look like this (example) during runtime -
it should be cleared at the end of the job, of course, and nonexisting
before the job begins...
<----
aixhp7
aixhp7
aixhp8
aixhp8
aixhp1
aixhp1
aixhp1
aixhp9
<----
for an 8-CPU job. But all depends on how your queue system is configured...
Best regards,
Torsten Andersen.
Ravindran Ponniah wrote:
> Hello,
>
> I am trying to setup k-point parallel jobs in a linux cluster
> here. If we ask for 8 cpus (for 8 kpt job), the queuing system allotting
> correctly 8 cpus. But, the jobs are running only in the master node (i.e. in 2 cpus)
> and the remaining 6 cpus are idle. We never had such problem in shared
> memory systems. I am enclosing herewith the message I have received from
> the system expert. Please inform me where we should look for to solve this
> problem.
>
> Best regards
> Ravi
> ###### communication from system expert
> Yes, it was run in parallel, but only on one node. If you don't use mpiexec,
> your executables doesn't start on all nodes. So your 8 processes where running
> on one node (that is 2 cpu's), while the other 6 processors are idle.
>
> Please look at the load of your nodes you are currently using on
> http://master.titan.uio.no/ganglia/:
>
> -bash-3.00$ qstat -g t | grep ravi
> 41798 0.25746 YBC5SO ravi r 08/14/2006 12:31:28
> kjemi at compute-1-0.local SLAVE
> 42404 0.25656 YBM6U ravi r 08/15/2006 10:48:09
> kjemi at compute-1-13.local SLAVE
> 41798 0.25746 YBC5SO ravi r 08/14/2006 12:31:28
> kjemi at compute-1-15.local SLAVE
> 41798 0.25746 YBC5SO ravi r 08/14/2006 12:31:28
> kjemi at compute-1-26.local SLAVE
> 41798 0.25746 YBC5SO ravi r 08/14/2006 12:31:28
> kjemi at compute-1-33.local SLAVE
> 41798 0.25746 YBC5SO ravi r 08/14/2006 12:31:28
> kjemi at compute-1-8.local MASTER
> 42404 0.25656 YBM6U ravi r 08/15/2006 10:48:09
> kjemi at compute-2-0.local SLAVE
> 42404 0.25656 YBM6U ravi r 08/15/2006 10:48:09
> kjemi at compute-2-11.local MASTER
>
> While your two master-nodes, 1-8 and 2-11, have a load of about 8 and 5 (8 and
> 5 processes) respectively, your slave nodes has a load of 0. You can also see
> this by logging into a master and slave node and do a:
>
> ps -ef | grep ravi
>
> We need to figure out a way to invoke mpiexec somewhere in order for this to
> run in parallel properly (at least above using 2 cpu's).
>
> best
> Torgeir
>
>
> On Tue, 15 Aug 2006, Ravindran Ponniah wrote:
>
>
>>On Tue, 15 Aug 2006, Torgeir Andersen Ruden wrote:
>>
>>
>>>It doesn't seem that you invoke mpiexec anywhere. You need to do this in
>>>order for parallel on clusters to work. Which part is supposed be parallel?
>>
>>In wien2k code there are two ways the jobs were parallelized. One is through
>>k-point parallelization and the other is called finegrain parallelization. We
>>are using the k-point parallelization. It will split the kpoints depends upon
>>the number of cpus used and run it in different nodes. See our dayfile
>>
>>###
>>LAPW0 END
>>LAPW1 END
>>LAPW1 END
>>LAPW1 END
>>LAPW1 END
>>LAPW1 END
>>LAPW1 END
>>LAPW1 END
>>LAPW1 END
>>LAPW1 END
>>LAPW1 END
>>LAPW1 END
>>LAPW1 END
>>LAPW1 END
>>LAPW1 END
>>LAPW1 END
>>LAPW1 END
>>LAPWSO END
>>LAPWSO END
>>LAPWSO END
>>LAPWSO END
>>LAPWSO END
>>LAPWSO END
>>LAPWSO END
>>LAPWSO END
>>LAPW2 - FERMI; weighs written
>>LAPW2 END
>>LAPW2 END
>>LAPW2 END
>>LAPW2 END
>>LAPW2 END
>>LAPW2 END
>>LAPW2 END
>>LAPW2 END
>>SUMPARA END
>>SUMPARA END
>>LAPW2 - FERMI; weighs written
>>###########
>>
>>We have used 8 cpu for the above calculation and hence lapw1, lapw2, lapwso
>>are ran in 8 cpus. So, though we have not executed mpiexec the job was
>>running in parallel.
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
--
Dr. Torsten Andersen TA-web: http://deep.at/myspace/
AG Hübner, Department of Physics, Kaiserslautern University
http://cmt.physik.uni-kl.de http://www.physik.uni-kl.de/
More information about the Wien
mailing list