[Wien] Problem in running k-point parallel jobs

Tue Aug 15 14:29:05 CEST 2006

Hello,

 	I am trying to setup k-point parallel jobs in a linux cluster 
here. If we ask for 8 cpus (for 8 kpt job), the queuing system allotting 
correctly 8 cpus. But, the jobs are running only in the master node (i.e. in 2 cpus) 
and the remaining 6 cpus are idle. We never had such problem in shared 
memory systems. I am enclosing herewith the message I have received from 
the system expert. Please inform me where we should look for to solve this 
problem.

Best regards
Ravi
###### communication from system expert
Yes, it was run in parallel, but only on one node. If you don't use mpiexec, 
your executables doesn't start on all nodes. So your 8 processes where running 
on one node (that is 2 cpu's), while the other 6 processors are idle.

Please look at the load of your nodes you are currently using on 
http://master.titan.uio.no/ganglia/:

-bash-3.00$ qstat -g t | grep ravi
   41798 0.25746 YBC5SO     ravi         r     08/14/2006 12:31:28 
kjemi at compute-1-0.local        SLAVE
   42404 0.25656 YBM6U      ravi         r     08/15/2006 10:48:09 
kjemi at compute-1-13.local       SLAVE
   41798 0.25746 YBC5SO     ravi         r     08/14/2006 12:31:28 
kjemi at compute-1-15.local       SLAVE
   41798 0.25746 YBC5SO     ravi         r     08/14/2006 12:31:28 
kjemi at compute-1-26.local       SLAVE
   41798 0.25746 YBC5SO     ravi         r     08/14/2006 12:31:28 
kjemi at compute-1-33.local       SLAVE
   41798 0.25746 YBC5SO     ravi         r     08/14/2006 12:31:28 
kjemi at compute-1-8.local        MASTER
   42404 0.25656 YBM6U      ravi         r     08/15/2006 10:48:09 
kjemi at compute-2-0.local        SLAVE
   42404 0.25656 YBM6U      ravi         r     08/15/2006 10:48:09 
kjemi at compute-2-11.local       MASTER

While your two master-nodes, 1-8 and 2-11, have a load of about 8 and 5 (8 and 
5 processes) respectively, your slave nodes has a load of 0. You can also see 
this by logging into a master and slave node and do a:

ps -ef | grep ravi

We need to figure out a way to invoke mpiexec somewhere in order for this to 
run in parallel properly (at least above using 2 cpu's).

best
Torgeir

On Tue, 15 Aug 2006, Ravindran Ponniah wrote:

> On Tue, 15 Aug 2006, Torgeir Andersen Ruden wrote:
> 
>> 
>> It doesn't seem that you invoke mpiexec anywhere. You need to do this in 
>> order for parallel on clusters to work. Which part is supposed be parallel?
> 
> In wien2k code there are two ways the jobs were parallelized. One is through 
> k-point parallelization and the other is called finegrain parallelization. We 
> are using the k-point parallelization. It will split the kpoints depends upon 
> the number of cpus used and run it in different nodes. See our dayfile
> 
> ###
> LAPW0 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPWSO END
> LAPWSO END
> LAPWSO END
> LAPWSO END
> LAPWSO END
> LAPWSO END
> LAPWSO END
> LAPWSO END
> LAPW2 - FERMI; weighs written
> LAPW2 END
> LAPW2 END
> LAPW2 END
> LAPW2 END
> LAPW2 END
> LAPW2 END
> LAPW2 END
> LAPW2 END
> SUMPARA END
> SUMPARA END
> LAPW2 - FERMI; weighs written
> ###########
> 
> We have used 8 cpu for the above calculation and hence lapw1, lapw2, lapwso 
> are ran in 8 cpus. So, though we have not executed mpiexec the job was 
> running in parallel.