[Wien] Problem in running k-point parallel jobs
Ravindran Ponniah
ravindran.ponniah at kjemi.uio.no
Tue Aug 15 14:29:05 CEST 2006
Hello,
I am trying to setup k-point parallel jobs in a linux cluster
here. If we ask for 8 cpus (for 8 kpt job), the queuing system allotting
correctly 8 cpus. But, the jobs are running only in the master node (i.e. in 2 cpus)
and the remaining 6 cpus are idle. We never had such problem in shared
memory systems. I am enclosing herewith the message I have received from
the system expert. Please inform me where we should look for to solve this
problem.
Best regards
Ravi
###### communication from system expert
Yes, it was run in parallel, but only on one node. If you don't use mpiexec,
your executables doesn't start on all nodes. So your 8 processes where running
on one node (that is 2 cpu's), while the other 6 processors are idle.
Please look at the load of your nodes you are currently using on
http://master.titan.uio.no/ganglia/:
-bash-3.00$ qstat -g t | grep ravi
41798 0.25746 YBC5SO ravi r 08/14/2006 12:31:28
kjemi at compute-1-0.local SLAVE
42404 0.25656 YBM6U ravi r 08/15/2006 10:48:09
kjemi at compute-1-13.local SLAVE
41798 0.25746 YBC5SO ravi r 08/14/2006 12:31:28
kjemi at compute-1-15.local SLAVE
41798 0.25746 YBC5SO ravi r 08/14/2006 12:31:28
kjemi at compute-1-26.local SLAVE
41798 0.25746 YBC5SO ravi r 08/14/2006 12:31:28
kjemi at compute-1-33.local SLAVE
41798 0.25746 YBC5SO ravi r 08/14/2006 12:31:28
kjemi at compute-1-8.local MASTER
42404 0.25656 YBM6U ravi r 08/15/2006 10:48:09
kjemi at compute-2-0.local SLAVE
42404 0.25656 YBM6U ravi r 08/15/2006 10:48:09
kjemi at compute-2-11.local MASTER
While your two master-nodes, 1-8 and 2-11, have a load of about 8 and 5 (8 and
5 processes) respectively, your slave nodes has a load of 0. You can also see
this by logging into a master and slave node and do a:
ps -ef | grep ravi
We need to figure out a way to invoke mpiexec somewhere in order for this to
run in parallel properly (at least above using 2 cpu's).
best
Torgeir
On Tue, 15 Aug 2006, Ravindran Ponniah wrote:
> On Tue, 15 Aug 2006, Torgeir Andersen Ruden wrote:
>
>>
>> It doesn't seem that you invoke mpiexec anywhere. You need to do this in
>> order for parallel on clusters to work. Which part is supposed be parallel?
>
> In wien2k code there are two ways the jobs were parallelized. One is through
> k-point parallelization and the other is called finegrain parallelization. We
> are using the k-point parallelization. It will split the kpoints depends upon
> the number of cpus used and run it in different nodes. See our dayfile
>
> ###
> LAPW0 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPWSO END
> LAPWSO END
> LAPWSO END
> LAPWSO END
> LAPWSO END
> LAPWSO END
> LAPWSO END
> LAPWSO END
> LAPW2 - FERMI; weighs written
> LAPW2 END
> LAPW2 END
> LAPW2 END
> LAPW2 END
> LAPW2 END
> LAPW2 END
> LAPW2 END
> LAPW2 END
> SUMPARA END
> SUMPARA END
> LAPW2 - FERMI; weighs written
> ###########
>
> We have used 8 cpu for the above calculation and hence lapw1, lapw2, lapwso
> are ran in 8 cpus. So, though we have not executed mpiexec the job was
> running in parallel.
More information about the Wien
mailing list