[Wien] MPI execution without any SSH access?

Tue Aug 30 16:21:12 CEST 2016

OK. If possible try and do a robust patch, i.e. one which is portable.
If you get one please send.

For reference, that call is important in most cases for larger
problems. It is related to the stacksize set for the user (ulimit). If
this is too small large Fortran (and other) programs can crash.
Unfortunately sometimes the default value is set too small by admins
and it can be hard to set it via shell arguments on remote nodes
(particularly with openmpi). That call sets it to the largest allowed
value.

Unfortunately C is not fully standard, so the exact parameter to use
can vary. I know it does with the Apple version of unix. It might be
as simple as adding a "#fidef ..." clause to look for whatever OS you
have and set the value, e.g. change the

#ifdef __APPLE__
    limit.rlim_cur = limit.rlim_max ; /* RLIM_INFINITY */
#else
    limit.rlim_cur = RLIM_INFINITY ;
#endif

On Tue, Aug 30, 2016 at 9:04 AM, Jan Oliver Oelerich
<jan.oliver.oelerich at physik.uni-marburg.de> wrote:
> Hi,
>
> Thank you for your quick reply. I am going to investigate the
> parallel_options together with the admins of our cluster.
>
> As for your questions:
>
> a) I am able to correctly generate the .machines file, so at least I
> know the nodes on which the calculation takes place.
>
> b) I will experiment with setrlimit() and see if I can patch W2kutils.c.
>
> Cheers
>
> On 30.08.2016 15:57, Laurence Marks wrote:
>> This is not so easy, and also this is probably not the only issue you
>> have. A few key points:
>>
>> 1) The default mechanism to connect is ssh, as this is the most
>> common. It is setup when you run configure, but can be changed later.
>> The expectation if you use ssh is that keyless login is setup (e.g.
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.linuxproblem.org_art-5F9.html&d=CwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=U_T4PL6jwANfAy4rnxTj8IUxm818jnvqKFdqWLwmqg0&m=PKBlPXzDaGDQihyfC5azMy2u9OxCHPWXrcHZTTKolLY&s=EXLJu9EpwEvdFPe4w_VO2P_IoZhcrzg0BiOUc0zcJic&e= ). Sometimes ssh is considered
>> to be a security issue by admins, which can lead to many issues. Other
>> commands such as rsh can be used -- but I have no idea if this will
>> work on your system.
>>
>> 2) There is a subsidary file "parallel_options" in $WIENROOT which can
>> be used to override this (and other) parallel options.
>>
>> 3) Many large cluster admins believe that users will just want to run
>> a single mpi job. Wien2k is much smarter than this, and exploits both
>> mpi and k-point parallelization, useful as k-point parallelization is
>> essentially 100% efficient (which mpi is not). You are going to have
>> to read carefully the documentation on how your particular system is
>> configured, and pay attention to any local customization.
>>
>> 4) Unfortunately mpi host formats vary with different systems, so you
>> will need to do some work to find out what you have and edit as needed
>> parallel_options. Peter has some scripts in the examples page,
>> although I think the unsupported set of utilities SRC_mpiutil are
>> better. They support a prior version of SGE, although your version may
>> be different.
>>
>> There are two "red flags" in your output which you will need to understand:
>>
>> a) "PSI: Found batch system of GridEngine flavour. Ignoring any choices of
>> nodes or hosts." You will need to know what the system is doing in
>> terms of nodes/hosts.
>>
>> b) "setrlimit(): WARNING: Cannot raise stack limit, continuing:
>> Invalid argument". This may be as simple as some format change being
>> needed in W2kutils.c but could be a more serious issue. What compiler
>> did you use?
>>
>> On Tue, Aug 30, 2016 at 8:22 AM, Jan Oliver Oelerich
>> <jan.oliver.oelerich at physik.uni-marburg.de> wrote:
>>> Dear Wien2k users,
>>>
>>> I am trying to set up Wien2k on a (mid-size) computation cluster running
>>> an SGE queueing system. Now, I am a bit confused as to how Wien2k spawns
>>> processes for MPI execution. I am used to the scheme, where mpirun takes
>>> care of spawning its processes across the nodes assigned to the job and
>>> automatically handles communication. In the Wien2k documentation,
>>> however, it sounds as if the master process connects via SSH (or
>>> similar) to the other nodes and starts something.
>>>
>>> I think I managed to compile and link everything correctly, but I am
>>> unable to run fine-grained parallel jobs. In the stderr (see below) I
>>> find, among other stuff I can't make any sense of, the following lines:
>>> "Host key verification failed.", which sounds like some SSH is failing.
>>>
>>> Could you help me understand how MPI parallelization is handled in
>>> Wien2k and how I could debug my calls? Is SSH really necessary?
>>>
>>> Best regards and thank you,
>>> Jan Oliver Oelerich
>>>
>>>
>>> =================== stderr =========================
>>>
>>> PairHess - Error. Check file pairhess.error.
>>> 0.003u 0.026s 0:00.51 3.9%      0+0k 123208+48io 8pf+0w
>>> cp: cannot stat `.minpair': No such file or directory
>>> cp: cannot stat `.minpair': No such file or directory
>>> PSI: Found batch system of GridEngine flavour. Ignoring any choices of
>>> nodes or hosts.
>>>   LAPW0 END
>>> 0.023u 0.042s 0:04.54 1.3%      0+0k 200+88io 1pf+0w
>>> [1]  + 15021 Running                       ( ( $remote $machine[$p] "cd
>>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>>> [1]  + 15021 Running                       ( ( $remote $machine[$p] "cd
>>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>>> [2]  - 15039 Running                       ( ( $remote $machine[$p] "cd
>>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>>> [1]  + 15021 Running                       ( ( $remote $machine[$p] "cd
>>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>>> [2]  - 15039 Running                       ( ( $remote $machine[$p] "cd
>>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>>> [3]    15067 Running                       ( ( $remote $machine[$p] "cd
>>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>>> Host key verification failed.
>>> Host key verification failed.
>>> [2]  - Done                          ( ( $remote $machine[$p] "cd
>>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>>> [1]  - Done                          ( ( $remote $machine[$p] "cd
>>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>>> Host key verification failed.
>>> [3]  + 15067 Running                       ( ( $remote $machine[$p] "cd
>>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>>> [4]  + 15090 Running                       ( ( $remote $machine[$p] "cd
>>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>>> [3]    Done                          ( ( $remote $machine[$p] "cd
>>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>>> Host key verification failed.
>>> [4]  + Done                          ( ( $remote $machine[$p] "cd
>>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>>> GaAs-Jan.scf1_1: No such file or directory.
>>> 0.132u 0.501s 0:02.52 25.0%     0+0k 1304+1048io 7pf+0w
>>> grep: *scf1*: No such file or directory
>>> setrlimit(): WARNING: Cannot raise stack limit, continuing: Invalid argument
>>> FERMI - Error
>>> cp: cannot stat `.in.tmp': No such file or directory
>>> 0.047u 0.086s 0:00.19 63.1%     0+0k 4488+200io 1pf+0w
>>>
>>>
>>> --
>>> Dr. Jan Oliver Oelerich
>>> Faculty of Physics and Material Sciences Center
>>> Philipps-Universität Marburg
>>>
>>> Addr.: Room 02D35, Hans-Meerwein-Straße 6, 35032 Marburg, Germany
>>> Phone: +49 6421 2822260
>>> Mail : jan.oliver.oelerich at physik.uni-marburg.de
>>> Web  : https://urldefense.proofpoint.com/v2/url?u=http-3A__academics.oelerich.org&d=CwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=U_T4PL6jwANfAy4rnxTj8IUxm818jnvqKFdqWLwmqg0&m=8zgpW5YDL2H7nA6zwTTcIC6Sq9GWaTOz1rLxJuhKhgA&s=aG3v0ylS6vXWztt6HtOLrNeQblCM1p4Fc0xIs6Ps_pQ&e=
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__zeus.theochem.tuwien.ac.at_mailman_listinfo_wien&d=CwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=U_T4PL6jwANfAy4rnxTj8IUxm818jnvqKFdqWLwmqg0&m=8zgpW5YDL2H7nA6zwTTcIC6Sq9GWaTOz1rLxJuhKhgA&s=B_5Z3rKnsEBuB5SXfTZV7vuFvJ1Oyc0QUb1U5uYkENk&e=
>>> SEARCH the MAILING-LIST at:  https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mail-2Darchive.com_wien-40zeus.theochem.tuwien.ac.at_index.html&d=CwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=U_T4PL6jwANfAy4rnxTj8IUxm818jnvqKFdqWLwmqg0&m=8zgpW5YDL2H7nA6zwTTcIC6Sq9GWaTOz1rLxJuhKhgA&s=UBfO7tQAbDs9wTdsRsvkomPuODh2zYIE_icdvxCNwNk&e=
>>
>>
>>
>
> --
> Dr. Jan Oliver Oelerich
> Faculty of Physics and Material Sciences Center
> Philipps-Universität Marburg
>
> Addr.: Room 02D35, Hans-Meerwein-Straße 6, 35032 Marburg, Germany
> Phone: +49 6421 2822260
> Mail : jan.oliver.oelerich at physik.uni-marburg.de
> Web  : https://urldefense.proofpoint.com/v2/url?u=http-3A__academics.oelerich.org&d=CwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=U_T4PL6jwANfAy4rnxTj8IUxm818jnvqKFdqWLwmqg0&m=PKBlPXzDaGDQihyfC5azMy2u9OxCHPWXrcHZTTKolLY&s=nY2-ju0jNbUtckXsGahLYu3GDVLd0EZvDgMXr1UE_3s&e=
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> https://urldefense.proofpoint.com/v2/url?u=http-3A__zeus.theochem.tuwien.ac.at_mailman_listinfo_wien&d=CwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=U_T4PL6jwANfAy4rnxTj8IUxm818jnvqKFdqWLwmqg0&m=PKBlPXzDaGDQihyfC5azMy2u9OxCHPWXrcHZTTKolLY&s=KTgAybnDf2W4iKQCSVuM6ZVTD2yxCPTS1Uix4l6PXRo&e=
> SEARCH the MAILING-LIST at:  https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mail-2Darchive.com_wien-40zeus.theochem.tuwien.ac.at_index.html&d=CwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=U_T4PL6jwANfAy4rnxTj8IUxm818jnvqKFdqWLwmqg0&m=PKBlPXzDaGDQihyfC5azMy2u9OxCHPWXrcHZTTKolLY&s=bqL0H84H-nBAIyLDDr5Zbsg4ard0MWYI_7XoPGypTS0&e=

-- 
Professor Laurence Marks
"Research is to see what everybody else has seen, and to think what
nobody else has thought", Albert Szent-Gyorgi
www.numis.northwestern.edu ; Corrosion in 4D: MURI4D.numis.northwestern.edu
Partner of the CFW 100% program for gender equity, www.cfw.org/100-percent
Co-Editor, Acta Cryst A