[Wien] Parallelization and PBS on a single computer

Gavin Abo gsabo at crimson.ua.edu
Thu Jun 29 14:20:43 CEST 2017


/var/spool/torque/mom_priv/jobs/44.milkbar-computer.kage.SC: line 12: 
run_lapw: command not found

Perhaps the environmental variables need pushed out to all nodes, you 
might try adding the line #PBS -V [1,2] to your job submission script.

[1] http://www.nics.tennessee.edu/node/387
[2] http://www.open-mpi.org/community/lists/users/2008/10/6982.php

On 6/28/2017 11:49 PM, Yoji Kobayashi wrote:
> Dear Users,
>
> I have a some questions/problems regarding parallelization and PBS.
> I’m not sure if I’m really running parallel vs. serial, and my PBS 
> script isn’t working.
>
> ===
> My system info:
> Intel Xeon CPU E5-2630 v2 @2.6 GHz, 24 CPUS
> Memory: 32GB
> Running Wien2k_13, on Ubuntu 14.04.03
> File system: ext4
> (This is considered a single node with 24 processors?)
> ===
> My first question is, am I really running a parallel calculation in a 
> meaningful way?
>
> What I try:
> In w2web, a serial calculation (SCF only)  for the TiC example  (500 k 
> points) takes about 25 sec. to converge.
> I do the same calculation (starting with a new case) but setting 
> parallelization in w2web, with slightly different .machine files for 
> each case:
>
> Case 1:
> 1:localhost
>
> Case 2 (i.e. 20 lines of below):
> 1:localhost
> 1:localhost
>> 1:localhost
> 1:localhost
>
> Case 3
> 1:localhost:20
>
> (no lines referring to granularity, etc for now)
>
> What I get:
> Case 1 computes in about 54 sec;
> Case 2 computes in 1min23 sec.;
> Case 3 gives an error in runninglapw2, see thedayfile below:
> -----
> Calculating YK-016-TiC in /home/milkbar/Yoji/YK-016-TiC
> on milkbar-computer with PID 18077
> using WIEN2k_13.1 (Release 17/6/2013) in /home/milkbar/WIEN2k_13
>
>
>      start 	(2017年  6月 29日 木曜日 14:23:39 JST) with lapw0 (40/99 to go)
>
>      cycle 1 	(2017年  6月 29日 木曜日 14:23:39 JST) 	(40/99 to go)
>
> >   lapw0 -p	(14:23:39) starting parallel lapw0 at 2017年  6月 29日 木曜日 14:23:39 JST
> -------- .machine0 : processors
> running lapw0 in single mode
> 1.7u 0.0s 0:01.84 98.3% 0+0k 16+440io 0pf+0w
> >   lapw1  -p    	(14:23:41) starting parallel lapw1 at 2017年  6月 29日 木曜日 14:23:41 JST
> ->  starting parallel LAPW1 jobs at 2017年  6月 29日 木曜日 14:23:41 JST
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
>       localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost(20) 20 total processes failed to start
> 0.0u 0.0s 0:00.20 10.0% 0+0k 8080+8io 23pf+0w
>     Summary of lapw1para:
>     localhost	 k=0	 user=0	 wallclock=0
> 0.0u 0.0s 0:02.10 0.9% 0+0k 8208+216io 24pf+0w
> >   lapw2 -p     	(14:23:43) running LAPW2 in parallel mode
> **  LAPW2 crashed!
> 0.0u 0.0s 0:00.07 28.5% 0+0k 32+104io 0pf+0w
> error: command   /home/milkbar/WIEN2k_13/lapw2para lapw2.def   failed
>
> >   stop error
> ------
> Is my “serial” calculation actually processed over 24 CPUs already, so this is why it is faster than Case 2? Or am I doing something wrong? Why does Case 3 crash?
>
> ====
> My second question is about PBS.
> I installed torque PBS, and created a queue:
>
> # create default queue
>  qmgr -c 'create queue batch'
>  qmgr -c 'set queue batch queue_type = execution'
>  qmgr -c 'set queue batch started = true'
>  qmgr -c 'set queue batch enabled = true'
>  qmgr -c 'set queue batch resources_default.walltime = 1:00:00'
>  qmgr -c 'set queue batch resources_default.nodes = 1'
>  qmgr -c 'set server default_queue = batch’
>
> and followed other instructions on
> https://jabriffa.wordpress.com/2015/02/11/installing-torquepbs-job-scheduler-on-ubuntu-14-04-lts/
>
> The PBS system seems to work since I can submit very simple scripts 
> and see them on qstat. My problem is that when I try to submit a 
> serial wien2k job via PBS, it gives me an error (ultimately of course 
> I’d like to submit them as parallel, but because of the ambiguity 
> above I’ve kept it to serial) . Here's the PBS script and error message:
>
>  #!/bin/tcsh
>  ##PBS -A your_allocation
>  # specify the allocation. Change it to your allocation
>  #PBS -q batch
>  #PBS -l nodes=1:ppn=20
>  #PBS -l walltime=1:00:00
>  #PBS -o wien2k_output
>  #PBS -j oe
>  #PBS -N wien2k_test
>  cd $PBS_O_WORKDIR
>  echo hello
>  run_lapw -i 40 -ec .0001 -I
>
> Error message (contents of wien2k_output):
> hello
> /var/spool/torque/mom_priv/jobs/44.milkbar-computer.kage.SC: line 12: 
> run_lapw: command not found
>
> The job is listed as complete in qstat, and the “hello” is written 
> into thewien2k_output file. Changing the cd $PBS_O_WORKDIR to the path 
> for the current case hasn’t changed anything. I can run run_lapwfrom 
> the command line fine, though. Also, what do I write for allocation? 
> (I commented it out, as I see other PBS scripts don’t always have this.)
>
> I’ve also tried the parallel case, with the following PBS script. I 
> set up the .structure file and do the initialization with w2web. I 
> leave the “parallel calculation” option unchecked when setting up the 
> case file in w2web.
>
>  #!/bin/tcsh
>  ##PBS -A your_allocation
>  #PBS -q batch
>  #PBS -l nodes=1:ppn=20
>  #PBS -l walltime=1:00:00
>  #
>  #PBS -o wien2k_output
>  #PBS -j oe
>  #PBS -N wien2k_test
>  cd $PBS_O_WORKDIR
>  #
>  #cat $PBS_NODEFILE |cut -c1-6 >.machines_currentdd
>  #set aa=`wc .machines_current`
>  #echo '#' > .machines
>  #
>  ##example for k-point parallel lapw1/2
>  set i=1
> while ($i <= $aa[1] )
> echo -n '1:' >>.machines
> head -$i .machines_current |tail -1 >> .machines
> @ i ++
>  end
> echo 'granularity:1' >>.machines
> echo 'extrafine:1' >>.machines
>  #
>  #define here your Wien2k command
>  run_lapw -p -i 40 -ec .0001 -I
>
> When I submit this job via qsub, again the job is immediately listed 
> as complete in qstat, and I get the following error message in 
> wien2k_output:
>
> milkbar at milkbar-computer:~/Yoji/YK-017-TiC$ cat wien2k_output
> /var/spool/torque/mom_priv/jobs/45.milkbar-computer.kage.SC: line 28: 
> syntax error: unexpected end of file
>
> No .machines file has been created in the case folder.
>  How can I successfully submit serial/parallel PBS jobs? Thanks in 
> advance for your help.
>
> Yoji Kobayashi
>
> ==========================================================
> Yoji Kobayashi, Junior Assoc. Prof. yojik at scl.kyoto-u.ac.jp 
> <mailto:yojik at scl.kyoto-u.ac.jp>
> http://www.scl.kyoto-u.ac.jp/~yojik/index.htm 
> <http://www.scl.kyoto-u.ac.jp/%7Eyojik/index.htm>
>
> Kageyama Group, Dept. of Energy and Hydrocarbon Chemistry
> Graduate School of Engineering, Kyoto University
> Nishikyo-ku, Kyoto 615-8510, Japan
>
> Tel.: +81-75-383-2509     Fax: +81-75-383-2510
> http://www.ehcc.kyoto-u.ac.jp/eh10/kageyama.html
> ==========================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20170629/e1f57c19/attachment.html>


More information about the Wien mailing list