[Wien] error in k-point parallel execution

C J Kenneth Tan -- OptimaNumerics cjtan at OptimaNumerics.com
Wed Jul 28 14:02:24 CEST 2004


Dear Mahmoud,

Assuming that the network for your cluster is a private network,
separated from the rest of the network, you might want to just use rsh
instead of ssh (rsh has much lower overhead than ssh, and security
would not be a concern if the machines are on a private network).  

NFS: The nodes on the cluster needs to be able to read the same files.
So you need to configure NFS, mount the file store on the machines
(maintain the same path!).  


Kenneth Tan
-----------------------------------------------------------------------
C. J. Kenneth Tan, Ph.D.
OptimaNumerics Ltd.
E-mail: cjtan at OptimaNumerics.com      Telephone: +44 798 941 7838
Web: http://www.OptimaNumerics.com    Facsimile: +44 289 066 3015
-----------------------------------------------------------------------




On 2004-07-28 16:03 +0430 Mahmoud Payami (mpayami at aeoi.org.ir) wrote:

> Date: Wed, 28 Jul 2004 16:03:53 +0430
> From: Mahmoud Payami <mpayami at aeoi.org.ir>
> Reply-To: wien at zeus.theochem.tuwien.ac.at
> To: wien at zeus.theochem.tuwien.ac.at
> Subject: Re: [Wien] error in k-point parallel execution
> 
> Dear Dr. Andersen,
> 
> >
> > I would need to see your *lapw1*.error files, unless they just say
> > "error in lapw1", but I would suspect they contain more information
> > since the one on your initial host finished.
> 
> 
> Here is the "lapw1.error" content:
> --------------
> **  Error in Parallel LAPW1
> **  LAPW1 STOPPED at Wed Jul 28 14:06:03 EDT 2004
> **  check ERROR FILES!
> --------------------------
> There exist only one other "lapw1_1.error" which is empty, and no other
> "lapw1_x.error" (x=2,3,4,5).
> 
> >
> > Do you have a common nfs-mounted home directory?
> >
> I am novice in linux and do not understand its meaning. But I have installed
> "nfs-utils" package on all pc's. Could you please let me know what should I
> do in order to meet this condition?
> 
> > Is the scratch directory in the same location on all machines?
> 
> I have installed wien in all nodes with the same specifications.
> 
> >
> > Have you configured $remote in wien2k to ssh?
> 
> Yes. I have chosen "ssh" in the "siteconfig_lapw" step.
> 
> > Can we see your .machines file?
> 
> Here is the ".machines" file content:
> -----------
> 1:localhost
> 1:condmat2
> 1:condmat3
> 1:condmat4
> 1:condmat5
> granularity:1
> extrafine:1
> 
> -----------------
> 
> Kind regards,
> Mahmoud Payami
> 
> 
> 
> 
> 
> 
> 
> > Best regards,
> > Torsten Andersen.
> >
> > Mahmoud Payami wrote:
> > > Dear Dr. Torsten Andersen,
> > >
> > > Thank you very much for your comment. I have reconfigured the hosts and
> > > passwordless ssh is possible from master to nodes and vice versa.
> > > I receive more or less the same error:
> > >
> > > ----------------------------
> > >  LAPW0 END
> > >  LAPW1 END
> > > 0.39user 0.04system 0:00.43elapsed 99%CPU (0avgtext+0avgdata
> 0maxresident)k
> > > 0inputs+0outputs (0major+5918minor)pagefaults 0swaps
> > > LAPW1 - Error
> > > 0.00user 0.00system 0:00.00elapsed 0%CPU (0avgtext+0avgdata
> 0maxresident)k
> > > 0inputs+0outputs (0major+204minor)pagefaults 0swaps
> > > LAPW1 - Error
> > > 0.00user 0.00system 0:00.00elapsed 0%CPU (0avgtext+0avgdata
> 0maxresident)k
> > > 0inputs+0outputs (0major+202minor)pagefaults 0swaps
> > > LAPW1 - Error
> > > 0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata
> 0maxresident)k
> > > 0inputs+0outputs (0major+202minor)pagefaults 0swaps
> > > LAPW1 - Error
> > > 0.00user 0.00system 0:00.00elapsed 0%CPU (0avgtext+0avgdata
> 0maxresident)k
> > > 0inputs+0outputs (0major+204minor)pagefaults 0swaps
> > >
> > > ---------------------------------------------
> > > My own analysis based on your comment is that only the part dedicated to
> > > localhost is performed without any problem but the remote hosts did not
> > > contribute.
> > > I checked the time spent for a password-less ssh to a remote host is
> about
> > > 10 seconds. Could it be some "timeout" error? If yes, how can it be
> fixed?
> > >
> > > Thank you in advance.
> > >
> > > Kind regards,
> > > Mahmoud Payami
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >>Dear Mr. Payami,
> > >>
> > >>you can only use nodes for which a key exist in the list of known hosts.
> > >>Otherwise it will exit at the prompt for password.
> > >>
> > >>Mahmoud Payami wrote:
> > >>
> > >>>Dear Wien Users & Developers,
> > >>>
> > >>>I noticed that in naming the nodes, one should not use the symbol "_".
> > >>>However, when I changed the names and did not use that symbol, I
> > >>>encountered the following new error in running scf:
> > >>>
> > >>>-------------------------------
> > >>> 0inputs+0outputs (8major+195minor)pagefaults 0swaps
> > >>>0.00user 0.00system 0:00.10elapsed 0%CPU (0avgtext+0avgdata
> > >
> > > 0maxresident)k
> > >
> > >>>LAPW1 - Error
> > >>>0inputs+0outputs (11major+192minor)pagefaults 0swaps
> > >>>0.00user 0.00system 0:00.10elapsed 1%CPU (0avgtext+0avgdata
> > >
> > > 0maxresident)k
> > >
> > >>>LAPW1 - Error
> > >>>0inputs+0outputs (0major+11620minor)pagefaults 0swaps
> > >>>1.21user 0.20system 0:01.42elapsed 99%CPU (0avgtext+0avgdata
> > >
> > > 0maxresident)k
> > >
> > >>> LAPW1 END
> > >>>0inputs+0outputs (8major+196minor)pagefaults 0swaps
> > >>>0.00user 0.00system 0:00.08elapsed 1%CPU (0avgtext+0avgdata
> > >
> > > 0maxresident)k
> > >
> > >>>LAPW1 - Error
> > >>>Warning: Permanently added 'condmat1' (RSA) to the list of known hosts.
> > >>
> > >>Here! This Warning tells me that you have not initiated your hosts
> > >
> > > properly.
> > >
> > >>> LAPW0 END
> > >>
> >
>   --------------------------------------------------------------------------
> > > -
> > >
> > >>>I would be grateful for any comment.
> > >>>
> > >>>Kindest regards,
> > >>>
> > >>>M. Payami
> > >>>
> > >>
> > >>Best regards,
> > >>Torsten Andersen.
> > >>
> > >>-- 
> > >>Dr. Torsten Andersen        TA-web: http://deep.at/myspace/
> > >>AG Hübner, Department of Physics, Kaiserslautern University
> > >>http://cmt.physik.uni-kl.de    http://www.physik.uni-kl.de/
> > >>
> > >>_______________________________________________
> > >>Wien mailing list
> > >>Wien at zeus.theochem.tuwien.ac.at
> > >>http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> > >>
> > >>
> > >
> > >
> > > _______________________________________________
> > > Wien mailing list
> > > Wien at zeus.theochem.tuwien.ac.at
> > > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> > >
> >
> > -- 
> > Dr. Torsten Andersen        TA-web: http://deep.at/myspace/
> > AG Hübner, Department of Physics, Kaiserslautern University
> > http://cmt.physik.uni-kl.de    http://www.physik.uni-kl.de/
> >
> > _______________________________________________
> > Wien mailing list
> > Wien at zeus.theochem.tuwien.ac.at
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >
> >
> 
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> 
> 
> 


More information about the Wien mailing list