[Wien] mpd invalid port info, MPI problem

Peter Blaha pblaha at theochem.tuwien.ac.at
Fri Sep 26 09:10:39 CEST 2008


I'm not sure my analysis will be 100% correct, since this is difficult without
doing things myself and I'dont understand all your output/input.

Anyway:

With this .machines file you are trying to do
a) k-point parallelization on two nodes, named "master" and "node".
In order that this works, you must be able to do   ssh node  and ssh master
without password and you mentioned, that this works fine.

b) In addition you request with this .machines file, that on node "master" you
run a parallel mpi-job with 2 processors, while on "node" you run 4 mpi-jobs in parallel.
Please note, that this means you must be able to start on "node" an mpi-job,
not just on "master" (you may have tested the latter, but not the first requirement)

> I installed intel MPI on a cluster and now I am trying to run Wien on the master and one node. k-parallelisation works. For fine grain, my machines file looks something like
> 
> 1:master:2
> 1:node:4
> granularity:1
> extrafine:1
> 
> lapw0 is done, of course, and in lapw1 I get the following dayfile:
> running LAPW1 in parallel mode (using .machines)
> 2 number_of_parallel_jobs
>      snode7 snode7 snode7 snode7(1) mpdboot_snode7 (handle_mpd_output 589): from mpd on iacgu1, invalid port info:

I'don't quite understand how lapw1para produces "snode7" with your machines file ??
But since there are 4 nodes involved, maybe your .machines file has  "snode7:4" ???

> /bin/sh: rsh: command not found
This message is most likely the most important one: Somewhere is trying to invoke "rsh".
This could be lapw1para (when you specified rsh and not ssh in siteconfig_lapw, or,
more likely it is your mpi-starter. Make sure that "mpdrun" (or whatever you use) is configured
correctly.

> 
>      iacgu1 iacgu1(1) Using    1 processors
> scalapack processors array (row,col):   1   1
> Using    1 processors
> scalapack processors array (row,col):   1   1

Also this I don't quite understand. "iacqu1" ???;   only ONE node ??? (maybe it always starts a
job on the local noed (this can be avoided by some switch).

> I started the mpdaemon on master and node and successfully executed commands via mpdrun -l -n 2 command
> But this only works with mpd & and ssh node mpd -h master -p port -d, when I try mpdboot -n 2 -f ~/mpd.hosts (which Wien seems to invoke) I get
> mpdboot_master (handle_mpd_output 589): from mpd on node.domain.name, invalid port info:
> node.domain.name: Connection refused
Wien invokes the mpi command you specified during siteconfig_lapw
(see file parallel_options in $WIENROOT)
You can define only ONE WIEN_MPIRUN command, i.e. it must be the same command
on master and on node.

I think you have to start first mpd "manually" (or in some script) on all required nodes,
but the WIEN_MPIRUN command should be the mpdrun ... command. (Do you really want the -l
switch ?)

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-15671             FAX: +43-1-58801-15698
Email: blaha at theochem.tuwien.ac.at    WWW: http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------


More information about the Wien mailing list