[Wien] MPI execution without any SSH access?

Jan Oliver Oelerich jan.oliver.oelerich at physik.uni-marburg.de
Tue Aug 30 15:22:11 CEST 2016


Dear Wien2k users,

I am trying to set up Wien2k on a (mid-size) computation cluster running 
an SGE queueing system. Now, I am a bit confused as to how Wien2k spawns 
processes for MPI execution. I am used to the scheme, where mpirun takes 
care of spawning its processes across the nodes assigned to the job and 
automatically handles communication. In the Wien2k documentation, 
however, it sounds as if the master process connects via SSH (or 
similar) to the other nodes and starts something.

I think I managed to compile and link everything correctly, but I am 
unable to run fine-grained parallel jobs. In the stderr (see below) I 
find, among other stuff I can't make any sense of, the following lines: 
"Host key verification failed.", which sounds like some SSH is failing.

Could you help me understand how MPI parallelization is handled in 
Wien2k and how I could debug my calls? Is SSH really necessary?

Best regards and thank you,
Jan Oliver Oelerich


=================== stderr =========================

PairHess - Error. Check file pairhess.error.
0.003u 0.026s 0:00.51 3.9%	0+0k 123208+48io 8pf+0w
cp: cannot stat `.minpair': No such file or directory
cp: cannot stat `.minpair': No such file or directory
PSI: Found batch system of GridEngine flavour. Ignoring any choices of 
nodes or hosts.
  LAPW0 END
0.023u 0.042s 0:04.54 1.3%	0+0k 200+88io 1pf+0w
[1]  + 15021 Running                       ( ( $remote $machine[$p] "cd 
$PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm 
-f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) 
bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop 
 >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
[1]  + 15021 Running                       ( ( $remote $machine[$p] "cd 
$PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm 
-f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) 
bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop 
 >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
[2]  - 15039 Running                       ( ( $remote $machine[$p] "cd 
$PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm 
-f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) 
bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop 
 >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
[1]  + 15021 Running                       ( ( $remote $machine[$p] "cd 
$PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm 
-f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) 
bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop 
 >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
[2]  - 15039 Running                       ( ( $remote $machine[$p] "cd 
$PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm 
-f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) 
bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop 
 >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
[3]    15067 Running                       ( ( $remote $machine[$p] "cd 
$PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm 
-f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) 
bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop 
 >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
Host key verification failed.
Host key verification failed.
[2]  - Done                          ( ( $remote $machine[$p] "cd 
$PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm 
-f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) 
bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop 
 >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
[1]  - Done                          ( ( $remote $machine[$p] "cd 
$PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm 
-f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) 
bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop 
 >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
Host key verification failed.
[3]  + 15067 Running                       ( ( $remote $machine[$p] "cd 
$PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm 
-f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) 
bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop 
 >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
[4]  + 15090 Running                       ( ( $remote $machine[$p] "cd 
$PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm 
-f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) 
bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop 
 >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
[3]    Done                          ( ( $remote $machine[$p] "cd 
$PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm 
-f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) 
bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop 
 >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
Host key verification failed.
[4]  + Done                          ( ( $remote $machine[$p] "cd 
$PWD;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm 
-f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) 
bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop 
 >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
GaAs-Jan.scf1_1: No such file or directory.
0.132u 0.501s 0:02.52 25.0%	0+0k 1304+1048io 7pf+0w
grep: *scf1*: No such file or directory
setrlimit(): WARNING: Cannot raise stack limit, continuing: Invalid argument
FERMI - Error
cp: cannot stat `.in.tmp': No such file or directory
0.047u 0.086s 0:00.19 63.1%	0+0k 4488+200io 1pf+0w


-- 
Dr. Jan Oliver Oelerich
Faculty of Physics and Material Sciences Center
Philipps-Universität Marburg

Addr.: Room 02D35, Hans-Meerwein-Straße 6, 35032 Marburg, Germany
Phone: +49 6421 2822260
Mail : jan.oliver.oelerich at physik.uni-marburg.de
Web  : http://academics.oelerich.org


More information about the Wien mailing list