[Wien] configuring parallel options using ssh

Gavin Abo gsabo at crimson.ua.edu
Tue Sep 11 04:12:32 CEST 2018


How are the several machines connected?  If the machines are connected 
using the currently typical 10/100 Mb/s, it is useless to do that [2].  
As was mentioned before, you need either 1 Gb/s [3] or InfiniBand [4].

Are the machines setup to have common (NFS) filesystem [5,6]?

The given information (error message) is insufficient.  So I doubt 
anyone can help.

For parallel calculations, it usually helps to provide:

a) What command was used to run the parallel calculation? For example, 
runsp -p command or qsub job.pbs?

b) If you are using SRC_mpiutil [7] or a job script [8], which one?  If 
it not exactly one seen on a website that you can provide a link to, 
then what are the contents of your job script?

c) Did you setup the .machines file for a k-point parallel or mpi 
parallel calculation?  What are the contents of the .machines file?  For 
a job script, it can be helpful to see both the job script and the 
.machines file it created.  The high performance computing (hpc) 
clusters [9] that effectively use mpi can be quite unique.  So in many 
cases it is not possible (for us on the mailing list) to run your job 
script to reproduce the .machines file unless it is done on the 
particular computer system that you are using.

d) If you search the mailing list archive, you should find there are 
other output files that could contain information for identifying and 
resolving such an error. For example,the standard input/output file [10-13].

[1] https://en.wikipedia.org/wiki/Fast_Ethernet
[2] 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg13632.html
[3] https://en.wikipedia.org/wiki/Gigabit_Ethernet ; 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg14035.html
[4] https://en.wikipedia.org/wiki/InfiniBand ; 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg05595.html
[5] 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg09554.html
[6] 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg09229.html
[7] http://susi.theochem.tuwien.ac.at/reg_user/unsupported/
[8] http://susi.theochem.tuwien.ac.at/reg_user/faq/pbs.html
[9] https://en.wikipedia.org/wiki/Supercomputer
[10] 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg13598.html
[11] 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg17317.html
[12] 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg15549.html
[13] 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg16551.html

On 9/10/2018 12:56 PM, Luc Fruchter wrote:
> Dear users,
>
> I failed configuring the parallel options to run cases on several 
> machines, each of them with several CPUs, driven by ssh protocol.
>
> * Configuring the parallel options with: shared memory, MPI = 0, ssh 
> protocol, allows to run parallel jobs using several CPUs on the same 
> machine. However, a .machines file with several machines will run 
> using all required CPUs on the machine where launched (ignoring hosts).
>
> - Configuring with: no shared memory, MPI = 0, ssh protocol, will run 
> no parallel jobs, either on the same or different machines (Below is 
> the output for the error in this case).
>
> All machines communicate without problem with ssh and no password, and 
> have identical file paths.
>
> Thanks for helping
>
> ------------------------------------------------------------------
>
> >   lapw0  -p    (20:33:36) starting parallel lapw0 at Mon Sep 10 
> 20:33:36 CEST 2018
> -------- .machine0 : processors
> running lapw0 in single mode
> 6.793u 0.073s 0:06.86 100.0%    0+0k 0+5152io 0pf+0w
> >   lapw1  -p        (20:33:43) starting parallel lapw1 at Mon Sep 10 
> 20:33:43 CEST 2018
> ->  starting parallel LAPW1 jobs at Mon Sep 10 20:33:43 CEST 2018
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
>      localhost(48)    Summary of lapw1para:
>    localhost     k=48     user=0     wallclock=0
> 0.112u 0.158s 0:02.28 11.4%    0+0k 0+224io 0pf+0w
> >   lapw2 -p         (20:33:45) running LAPW2 in parallel mode
> **  LAPW2 crashed!
> 0.085u 0.062s 0:00.13 107.6%    0+0k 0+872io 0pf+0w
> error: command   /root/Documents/WIEN2KROOT/lapw2para lapw2.def failed
>
> >   stop error


More information about the Wien mailing list