[Wien] k-point parallel job in distributed file system

XU ZUO xzuo at nankai.edu.cn
Thu Aug 17 14:33:30 CEST 2006


Unfortunately, I am suffered from the instability of the k-point
parallelization. I understand that this problem is caused by the bad NFS
performance (problems on read/write latency and synchronization) and that
adjusting $delay and $sleepy may solve the problem. However, as the cluster
load and traffic are dynamic, it is better to design adaptive code, which
can handle this problem dynamically. 

-----Original Message-----
From: wien-bounces at zeus.theochem.tuwien.ac.at
[mailto:wien-bounces at zeus.theochem.tuwien.ac.at] On Behalf Of Stefaan
Cottenier
Sent: Thursday, August 17, 2006 7:57 PM
To: wien at zeus.theochem.tuwien.ac.at
Subject: Re: [Wien] k-point parallel job in distributed file system

You say you were able to do a k-point parallel run, but it was slow.  
This means that all your nodes can access a common place (where your
case.struct etc. are). Your problem probably is that you have put $SCRATCH
also in that same directory, which indeed causes a lot of network traffic.
The solution is easy: either you assign to $SCRATCH a directory that exists
on all your nodes (often this is the case for /tmp), or -- if that is not
possible -- you assign on-the-fly the correct workspace directory for the
node(s) you are submittin to (like in the PBS script from the other reply).

Stefaan


> Hello,
>
>  	We are trying to do k-point parallel wien2k job in a linux cluster 
> which has distributed file system. Though we are able to do k-point 
> parallel calculation, we have a problem in assigning a common work 
> space ($SCRATCH) to read/write all input/output files. This means 
> that, for example, if we do a 10 kpoint calculation in 10 nodes, all 
> the 10 nodes should communicate to the common working area through ssh 
> to read/write files. This slows down the performance and also the network.
> So far we have done k-point parallel calculations in supercomputers 
> with shared memory and hence we never had such a problem.  Is it 
> possible to do k-point parallel calculations in distributed file 
> system without any common working area?
>
> I have received the following from the system expert here.
>
> ###
> Hmm, I've been looking through the jungle of scipts which constitutes 
> wien2k, and it is clear to me that this way of paralellizing isn't 
> meant for distributed filesystems (local disks on nodes). Unless the 
> wien2k people have a solution, I don't think we will get around this 
> without some major reprogramming. At least it seems so to me, but I 
> must admit that I don't have the complete overview of todo tasks.
>
> Also a quick google of the proble, did not provide a solution.
> This is very efficient for SMP types of machines, but is a bit ad-hoc 
> for cluster type computers.
> On the bright side, it doesn't seem taht the program does a lot of 
> disk read/write in the long run. Only 10-20 min bursts of 10 megs/sek.
> ####
>
> Looking forward your responses to do the computation more efficently.
>
> Best regards
> Ravi

Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm

_______________________________________________
Wien mailing list
Wien at zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien




More information about the Wien mailing list