[Wien] k-point parallel job in distributed file system

XU ZUO xzuo at nankai.edu.cn
Fri Aug 18 02:18:16 CEST 2006


Thank you for your help. 

I am reading the Linux NFS HOWTO
(http://www.tldp.org/HOWTO/NFS-HOWTO/index.html). In Chapter 5, this doc
does mention the NIC driver problem. I hope that I can find the problem.

Xu Zuo

-----Original Message-----
From: wien-bounces at zeus.theochem.tuwien.ac.at
[mailto:wien-bounces at zeus.theochem.tuwien.ac.at] On Behalf Of B. Yanchitsky
Sent: Friday, August 18, 2006 12:18 AM
To: A Mailing list for WIEN2k users
Subject: Re: [Wien] k-point parallel job in distributed file system

Below are problematic ethernet controller

02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.: Unknown device
8168 (rev 01)

and driver
r1000-8111b\(102\).zip

The driver is provided by Realtek.

Bogdan



B. Yanchitsky wrote:
> I had a problem with NFS - large files (~500Mb) not copied correctly 
> and small transfer rate (something like 100Kb/sec instead of 
> 10Mb/sec). I tried various linux distributions and spent much time in 
> attempts to locate the problem. After a month it appeared that problem is
either due to on-board ethernet card or, most probably, broken driver for
ethernet card.
> With another driver and ethernet card i don't have problems.
> Check at least, that large files are copied correctly and transfer rate is
good.
> 
> Regards,
> Bogdan
> 
> XU ZUO wrote:
> 
>>Unfortunately, I am suffered from the instability of the k-point 
>>parallelization. I understand that this problem is caused by the bad 
>>NFS performance (problems on read/write latency and synchronization) 
>>and that adjusting $delay and $sleepy may solve the problem. However, 
>>as the cluster load and traffic are dynamic, it is better to design 
>>adaptive code, which can handle this problem dynamically.
>>
>>-----Original Message-----
>>From: wien-bounces at zeus.theochem.tuwien.ac.at
>>[mailto:wien-bounces at zeus.theochem.tuwien.ac.at] On Behalf Of Stefaan 
>>Cottenier
>>Sent: Thursday, August 17, 2006 7:57 PM
>>To: wien at zeus.theochem.tuwien.ac.at
>>Subject: Re: [Wien] k-point parallel job in distributed file system
>>
>>You say you were able to do a k-point parallel run, but it was slow.  
>>This means that all your nodes can access a common place (where your 
>>case.struct etc. are). Your problem probably is that you have put 
>>$SCRATCH also in that same directory, which indeed causes a lot of network
traffic.
>>The solution is easy: either you assign to $SCRATCH a directory that 
>>exists on all your nodes (often this is the case for /tmp), or -- if 
>>that is not possible -- you assign on-the-fly the correct workspace 
>>directory for the
>>node(s) you are submittin to (like in the PBS script from the other
reply).
>>
>>Stefaan
>>
>>
>>
>>
>>>Hello,
>>>
>>>	We are trying to do k-point parallel wien2k job in a linux cluster 
>>>which has distributed file system. Though we are able to do k-point 
>>>parallel calculation, we have a problem in assigning a common work 
>>>space ($SCRATCH) to read/write all input/output files. This means 
>>>that, for example, if we do a 10 kpoint calculation in 10 nodes, all 
>>>the 10 nodes should communicate to the common working area through 
>>>ssh to read/write files. This slows down the performance and also the
network.
>>>So far we have done k-point parallel calculations in supercomputers 
>>>with shared memory and hence we never had such a problem.  Is it 
>>>possible to do k-point parallel calculations in distributed file 
>>>system without any common working area?
>>>
>>>I have received the following from the system expert here.
>>>
>>>###
>>>Hmm, I've been looking through the jungle of scipts which constitutes 
>>>wien2k, and it is clear to me that this way of paralellizing isn't 
>>>meant for distributed filesystems (local disks on nodes). Unless the 
>>>wien2k people have a solution, I don't think we will get around this 
>>>without some major reprogramming. At least it seems so to me, but I 
>>>must admit that I don't have the complete overview of todo tasks.
>>>
>>>Also a quick google of the proble, did not provide a solution.
>>>This is very efficient for SMP types of machines, but is a bit ad-hoc 
>>>for cluster type computers.
>>>On the bright side, it doesn't seem taht the program does a lot of 
>>>disk read/write in the long run. Only 10-20 min bursts of 10 megs/sek.
>>>####
>>>
>>>Looking forward your responses to do the computation more efficently.
>>>
>>>Best regards
>>>Ravi
>>
>>
>>Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm
>>
>>_______________________________________________
>>Wien mailing list
>>Wien at zeus.theochem.tuwien.ac.at
>>http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>>
>>_______________________________________________
>>Wien mailing list
>>Wien at zeus.theochem.tuwien.ac.at
>>http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>>
> 
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> 
> 

_______________________________________________
Wien mailing list
Wien at zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien




More information about the Wien mailing list