[Wien] MPI execution without any SSH access?

Jan Oliver Oelerich jan.oliver.oelerich at physik.uni-marburg.de
Tue Aug 30 16:04:42 CEST 2016


Hi,

Thank you for your quick reply. I am going to investigate the 
parallel_options together with the admins of our cluster.

As for your questions:

a) I am able to correctly generate the .machines file, so at least I 
know the nodes on which the calculation takes place.

b) I will experiment with setrlimit() and see if I can patch W2kutils.c.

Cheers

On 30.08.2016 15:57, Laurence Marks wrote:
> This is not so easy, and also this is probably not the only issue you
> have. A few key points:
>
> 1) The default mechanism to connect is ssh, as this is the most
> common. It is setup when you run configure, but can be changed later.
> The expectation if you use ssh is that keyless login is setup (e.g.
> http://www.linuxproblem.org/art_9.html). Sometimes ssh is considered
> to be a security issue by admins, which can lead to many issues. Other
> commands such as rsh can be used -- but I have no idea if this will
> work on your system.
>
> 2) There is a subsidary file "parallel_options" in $WIENROOT which can
> be used to override this (and other) parallel options.
>
> 3) Many large cluster admins believe that users will just want to run
> a single mpi job. Wien2k is much smarter than this, and exploits both
> mpi and k-point parallelization, useful as k-point parallelization is
> essentially 100% efficient (which mpi is not). You are going to have
> to read carefully the documentation on how your particular system is
> configured, and pay attention to any local customization.
>
> 4) Unfortunately mpi host formats vary with different systems, so you
> will need to do some work to find out what you have and edit as needed
> parallel_options. Peter has some scripts in the examples page,
> although I think the unsupported set of utilities SRC_mpiutil are
> better. They support a prior version of SGE, although your version may
> be different.
>
> There are two "red flags" in your output which you will need to understand:
>
> a) "PSI: Found batch system of GridEngine flavour. Ignoring any choices of
> nodes or hosts." You will need to know what the system is doing in
> terms of nodes/hosts.
>
> b) "setrlimit(): WARNING: Cannot raise stack limit, continuing:
> Invalid argument". This may be as simple as some format change being
> needed in W2kutils.c but could be a more serious issue. What compiler
> did you use?
>
> On Tue, Aug 30, 2016 at 8:22 AM, Jan Oliver Oelerich
> <jan.oliver.oelerich at physik.uni-marburg.de> wrote:
>> Dear Wien2k users,
>>
>> I am trying to set up Wien2k on a (mid-size) computation cluster running
>> an SGE queueing system. Now, I am a bit confused as to how Wien2k spawns
>> processes for MPI execution. I am used to the scheme, where mpirun takes
>> care of spawning its processes across the nodes assigned to the job and
>> automatically handles communication. In the Wien2k documentation,
>> however, it sounds as if the master process connects via SSH (or
>> similar) to the other nodes and starts something.
>>
>> I think I managed to compile and link everything correctly, but I am
>> unable to run fine-grained parallel jobs. In the stderr (see below) I
>> find, among other stuff I can't make any sense of, the following lines:
>> "Host key verification failed.", which sounds like some SSH is failing.
>>
>> Could you help me understand how MPI parallelization is handled in
>> Wien2k and how I could debug my calls? Is SSH really necessary?
>>
>> Best regards and thank you,
>> Jan Oliver Oelerich
>>
>>
>> =================== stderr =========================
>>
>> PairHess - Error. Check file pairhess.error.
>> 0.003u 0.026s 0:00.51 3.9%      0+0k 123208+48io 8pf+0w
>> cp: cannot stat `.minpair': No such file or directory
>> cp: cannot stat `.minpair': No such file or directory
>> PSI: Found batch system of GridEngine flavour. Ignoring any choices of
>> nodes or hosts.
>>   LAPW0 END
>> 0.023u 0.042s 0:04.54 1.3%      0+0k 200+88io 1pf+0w
>> [1]  + 15021 Running                       ( ( $remote $machine[$p] "cd
>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>> [1]  + 15021 Running                       ( ( $remote $machine[$p] "cd
>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>> [2]  - 15039 Running                       ( ( $remote $machine[$p] "cd
>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>> [1]  + 15021 Running                       ( ( $remote $machine[$p] "cd
>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>> [2]  - 15039 Running                       ( ( $remote $machine[$p] "cd
>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>> [3]    15067 Running                       ( ( $remote $machine[$p] "cd
>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>> Host key verification failed.
>> Host key verification failed.
>> [2]  - Done                          ( ( $remote $machine[$p] "cd
>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>> [1]  - Done                          ( ( $remote $machine[$p] "cd
>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>> Host key verification failed.
>> [3]  + 15067 Running                       ( ( $remote $machine[$p] "cd
>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>> [4]  + 15090 Running                       ( ( $remote $machine[$p] "cd
>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>> [3]    Done                          ( ( $remote $machine[$p] "cd
>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>> Host key verification failed.
>> [4]  + Done                          ( ( $remote $machine[$p] "cd
>> $PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm
>> -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop )
>> bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop
>>  >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
>> GaAs-Jan.scf1_1: No such file or directory.
>> 0.132u 0.501s 0:02.52 25.0%     0+0k 1304+1048io 7pf+0w
>> grep: *scf1*: No such file or directory
>> setrlimit(): WARNING: Cannot raise stack limit, continuing: Invalid argument
>> FERMI - Error
>> cp: cannot stat `.in.tmp': No such file or directory
>> 0.047u 0.086s 0:00.19 63.1%     0+0k 4488+200io 1pf+0w
>>
>>
>> --
>> Dr. Jan Oliver Oelerich
>> Faculty of Physics and Material Sciences Center
>> Philipps-Universität Marburg
>>
>> Addr.: Room 02D35, Hans-Meerwein-Straße 6, 35032 Marburg, Germany
>> Phone: +49 6421 2822260
>> Mail : jan.oliver.oelerich at physik.uni-marburg.de
>> Web  : https://urldefense.proofpoint.com/v2/url?u=http-3A__academics.oelerich.org&d=CwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=U_T4PL6jwANfAy4rnxTj8IUxm818jnvqKFdqWLwmqg0&m=8zgpW5YDL2H7nA6zwTTcIC6Sq9GWaTOz1rLxJuhKhgA&s=aG3v0ylS6vXWztt6HtOLrNeQblCM1p4Fc0xIs6Ps_pQ&e=
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__zeus.theochem.tuwien.ac.at_mailman_listinfo_wien&d=CwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=U_T4PL6jwANfAy4rnxTj8IUxm818jnvqKFdqWLwmqg0&m=8zgpW5YDL2H7nA6zwTTcIC6Sq9GWaTOz1rLxJuhKhgA&s=B_5Z3rKnsEBuB5SXfTZV7vuFvJ1Oyc0QUb1U5uYkENk&e=
>> SEARCH the MAILING-LIST at:  https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mail-2Darchive.com_wien-40zeus.theochem.tuwien.ac.at_index.html&d=CwIGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=U_T4PL6jwANfAy4rnxTj8IUxm818jnvqKFdqWLwmqg0&m=8zgpW5YDL2H7nA6zwTTcIC6Sq9GWaTOz1rLxJuhKhgA&s=UBfO7tQAbDs9wTdsRsvkomPuODh2zYIE_icdvxCNwNk&e=
>
>
>

-- 
Dr. Jan Oliver Oelerich
Faculty of Physics and Material Sciences Center
Philipps-Universität Marburg

Addr.: Room 02D35, Hans-Meerwein-Straße 6, 35032 Marburg, Germany
Phone: +49 6421 2822260
Mail : jan.oliver.oelerich at physik.uni-marburg.de
Web  : http://academics.oelerich.org


More information about the Wien mailing list