[Wien] dstart_mpi error

karima Physique physique.karima at gmail.com
Thu Jul 19 15:03:51 CEST 2018


 *dear dr Gavin Abo*
actually, the problem was solved by adding the hostname in the hosts file
in all the nodes  and not only in the master node.

now the calculation works very well but at each excusion of LAPW0 in the
scf I get this error without affecting the calculations :
*""calcul.23539PSM2 no hfi units are available (err=23)""*

I would be grateful if you can help me solve this problem even though it
does not affect the calculations

Le jeu. 19 juil. 2018 à 14:06, Gavin Abo <gsabo at crimson.ua.edu> a écrit :

> A response off the mailing list:
>
> I currently do not know.  Your .machines file seems fine.
>
> I don't know for sure, but the calcul.local appears to be coming from your
> mpi program and maybe not WIEN2k.  You do not say what mpi program package
> (openmpi, intelmpi, MPICH, or other) you are using.  Though, from the error
> message, it looks like you may be using MPICH.  If you are using MPICH, I
> have limited experience with it.  So, you may have to ask the MPICH experts
> about the "unable to get host address" and "unable to connect to server"
> errors [ https://www.mpich.org/support/ ].
>
> Since you did not mention, I assume your using one of the latest WIEN2k
> versions (WIEN2k 18.1 or 18.2).  There may have been some WIEN2k mpi bugs
> in previous versions.  So, if you are using a older version, you may want
> to try the latest WIEN2k 18.2 version to see if it maybe resolves the
> problem.
>
> You might try resolving the hostnames and check the ip addresses.
>
> Check and see if the ip address set in the hosts file for calcul.local are
> the same or different from master, node1, and node2.
>
> For example, I think you can resolve the hostname to an ip address using
> on the cluster the terminal commands:
> ping -c 1 calcul.local
> ping -c 1 master
> ping -c 1 node1
> ping -c 1 node2
>
> After doing the above ping commands on the master node, you may want to do
> the above ping commands while on each of the subnodes like node1 after
> first using for example:
>
> ssh node1
>
> For example, maybe on the master, it can resolve the ip address from calcul.local.
> However, if you login into node2 (ssh node2), maybe node2 cannot resolve
> the ip address from calcul.local.  That may be another possible cause of
> the problem.
>
> Unfortunately, since I don't have access to a system having that exact
> same error, it is hard to see why those errors are happening as there seems
> to be many possible causes of that problem and not a single one.
>
> Kind Regards,
>
> Gavin
>
> On 7/19/2018 5:05 AM, karima Physique wrote:
>
> *Thank you dr Gavin Abo*
> *I checked the etc/hosts file and it is ok*
> *but why lapw1_mpi works fine and in all the nodes while dstart_mi and
> lapw0_mpi do not work on the nodes*
>
> Le jeu. 19 juil. 2018 à 04:23, Gavin Abo <gsabo at crimson.ua.edu> a écrit :
>
>> As the error message says, one possible cause is the connection being
>> blocked by a firewall.
>>
>> Another possible cause is a ssh passwordless access problem:
>>
>>
>> https://stackoverflow.com/questions/19565795/unable-to-execute-mpich2-on-multiple-machines-on-ubuntu-12-04-hydu-sock-connect
>>
>> Yet, another possible cause is a problem resolving the DNS hostname:
>> https://forums.suse.com/archive/index.php/t-6057.html
>>
>> https://www.slothparadise.com/running-mpi-common-mpi-troubleshooting-problems/
>>
>> Since /etc/hosts usually cannot be edited by a user, the cluster
>> administrator would have to fix the hosts file if that happens to be the
>> source of the problem.
>> On 7/18/2018 6:07 PM, karima Physique wrote:
>>
>> Dear wien2k users:
>>
>> Using the folowing machines files :
>> lapw0:master:12
>> dstart:master:12
>> 1:master:12
>> 1:node1:12
>> 1:node2:12
>> ......
>> the calculation works very well, but using the following machines file:
>> lapw0:master:12  node1:12  node2:12
>> dstart:master:12  node1:12  node2:12
>> 1:master:12
>> 1:node1:12
>> 1:node2:12
>> .......
>> I got the following error:
>>
>> unable to get host adress calcul.local for (1)
>> unable to  connect to server  calcul.local at port 44295 (chek for
>> firewalls!)
>> we note that  calcul.local is the host to connect to w2web.
>> I ask you any suggestions to solve this problem
>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at:
>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>>
>
>
> _______________________________________________
> Wien mailing listWien at zeus.theochem.tuwien.ac.athttp://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20180719/772e1abf/attachment.html>


More information about the Wien mailing list