[Wien] Issues with parallel runs

Matthew Redell mredell1 at binghamton.edu
Fri Jan 11 12:27:59 CET 2019


Thank you! The ssh was, in fact, the issue. We have three workstations linked as a “cluster” via nfs mounts and set up our system to use modules that load certain elements into the PATH, so the ssh was losing the loaded modules.  Everything seems to be working OK now.

Matt

> On Jan 10, 2019, at 11:10 PM, Gavin Abo <gsabo at crimson.ua.edu> wrote:
> 
> As Prof. Marks hinted at, it looks to me that mpirun is properly set in your PATH of your current terminal such that mpirun works.
> 
> However, when you run "run_lapw -p", probably when that script executes the lapw1para_lapw script, the program is probably then doing an ssh into the nodes you have set in your .machines file.  It is on the nodes that that mpirun command probably cannot be found.
> 
> You can likely test if that is the case in the terminal by trying commands like:
> 
> ssh localhost
> which mpirun
> exit
> where localhost above should be replaced by the hostname (or ip address) to one of your local (e.g. https://en.wikipedia.org/wiki/Localhost ) or remote nodes that you have used in your hand edited .machines file.  Or if your .machines file is created automatically on the fly by your job script [ http://susi.theochem.tuwien.ac.at/reg_user/faq/pbs.html ], which is usually the case for the clusters needed and used for mpi parallel calculations, you should be able find the hostnames in the .machines file that it tried to use for the calculation when it failed.
> 
> If you set the base path to mpirun in the PATH of your .bashrc (or .cshrc) [e.g., https://www.open-mpi.org/faq/?category=running#adding-ompi-to-path ] and your system pushes that out to all nodes, that might resolve the problem.  If you are using a job script, depending on your queuing system, you might have to add an option to push the environment to the nodes [ https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg15985.html ].
> 
> If your system doesn't access remote nodes with ssh, see section "11.4 Environment Variables" in the WIEN2k 18.2 userguide [http://susi.theochem.tuwien.ac.at/reg_user/textbooks/usersguide.pdf] about setting "USE_REMOTE 0" in parallel_options so that it does not use ssh.
> 
> Depending on the hardware specifications of your workstation, keep in mind as mentioned in the list before if it is a general-purpose computer, and not a high performance computing (HPC) cluster [ https://en.wikipedia.org/wiki/Supercomputer ], that k-point parallel might work better than mpi parallel for certain computer systems (or calculation cases):
> 
> http://susi.theochem.tuwien.ac.at/reg_user/benchmark/
> https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg03793.html
> https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg08301.html
> https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg13632.html
> https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg09334.html
> 
>> On 1/10/2019 12:44 PM, Laurence Marks wrote:
>> Most probably you forgot to export your PATH (e.g. from bash do "export PATH") so the information is not making it beyond your shell. You might have a bad csh/tcsh. Try adding "which lapw1_mpi" to $WIENROOT/parallel_options, and check that file has correct setenv for  WIEN_MPIRUN.
>> 
>>> On Thu, Jan 10, 2019 at 12:00 PM Matthew D Redell <mredell1 at binghamton.edu> wrote:
>>> Hello,
>>> 
>>> I am running WIEN2k_2018.2 on CentOs7 and have come across the following problem that I cannot seem to resolve.
>>> 
>>> After successfully initializing the calculation and setting up the .machines for a single host (local workstation), I run: run_lapw -p
>>> 
>>> lapw0 ends fine, but the lapw1 returns
>>> bash: mpirun: command not found
>>> 
>>> The same error occurs if I just try 
>>> x lapw1 -p
>>> 
>>> However, which mpirun
>>> returns 
>>> /opt/intel/compilers_and_libraries_2019.1.144/linux/mpi/intel64/bin/mpirun
>>> 
>>> I also did a little troubleshooting to see if I could run lapw1 in parallel via
>>> mpirun -n 4 lapw1_mpi lapw1_1.def
>>> 
>>> which ran without any issues. Also, checking more…
>>> grep MPIRUN $WIENROOT/WIEN2k_OPTIONS
>>> returns
>>> current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_
>>> 
>>> So, the only possibility I am able to deduce is that either the run_lapw script or the lapw1para script is not locating the mpirun command, but I do not know how to begin sorting out this issue. Any help would be greatly appreciated.
>>> 
>>> Best,
>>> Matt
>>> 
>>> ------------------------
>>> Matthew D Redell
>>> Graduate Student/Teaching Assistant
>>> Department of Physics, Applied Physics, and Astronomy
>>> Binghamton University-State University of New York
>>> E-mail: mredell1 at binghamton.edu
>>> Office: SN-2011D
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> -- 
>> Professor Laurence Marks
>> "Research is to see what everybody else has seen, and to think what nobody else has thought", Albert Szent-Gyorgi
>> www.numis.northwestern.edu ; Corrosion in 4D: MURI4D.numis.northwestern.edu
>> Partner of the CFW 100% program for gender equity, www.cfw.org/100-percent
>> Co-Editor, Acta Cryst A
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20190111/1ae26b8c/attachment.html>


More information about the Wien mailing list