[Wien] Parallelization and PBS on a single computer
Gavin Abo
gsabo at crimson.ua.edu
Thu Jun 29 14:20:43 CEST 2017
/var/spool/torque/mom_priv/jobs/44.milkbar-computer.kage.SC: line 12:
run_lapw: command not found
Perhaps the environmental variables need pushed out to all nodes, you
might try adding the line #PBS -V [1,2] to your job submission script.
[1] http://www.nics.tennessee.edu/node/387
[2] http://www.open-mpi.org/community/lists/users/2008/10/6982.php
On 6/28/2017 11:49 PM, Yoji Kobayashi wrote:
> Dear Users,
>
> I have a some questions/problems regarding parallelization and PBS.
> I’m not sure if I’m really running parallel vs. serial, and my PBS
> script isn’t working.
>
> ===
> My system info:
> Intel Xeon CPU E5-2630 v2 @2.6 GHz, 24 CPUS
> Memory: 32GB
> Running Wien2k_13, on Ubuntu 14.04.03
> File system: ext4
> (This is considered a single node with 24 processors?)
> ===
> My first question is, am I really running a parallel calculation in a
> meaningful way?
>
> What I try:
> In w2web, a serial calculation (SCF only) for the TiC example (500 k
> points) takes about 25 sec. to converge.
> I do the same calculation (starting with a new case) but setting
> parallelization in w2web, with slightly different .machine files for
> each case:
>
> Case 1:
> 1:localhost
>
> Case 2 (i.e. 20 lines of below):
> 1:localhost
> 1:localhost
> …
> 1:localhost
> 1:localhost
>
> Case 3
> 1:localhost:20
>
> (no lines referring to granularity, etc for now)
>
> What I get:
> Case 1 computes in about 54 sec;
> Case 2 computes in 1min23 sec.;
> Case 3 gives an error in runninglapw2, see thedayfile below:
> -----
> Calculating YK-016-TiC in /home/milkbar/Yoji/YK-016-TiC
> on milkbar-computer with PID 18077
> using WIEN2k_13.1 (Release 17/6/2013) in /home/milkbar/WIEN2k_13
>
>
> start (2017年 6月 29日 木曜日 14:23:39 JST) with lapw0 (40/99 to go)
>
> cycle 1 (2017年 6月 29日 木曜日 14:23:39 JST) (40/99 to go)
>
> > lapw0 -p (14:23:39) starting parallel lapw0 at 2017年 6月 29日 木曜日 14:23:39 JST
> -------- .machine0 : processors
> running lapw0 in single mode
> 1.7u 0.0s 0:01.84 98.3% 0+0k 16+440io 0pf+0w
> > lapw1 -p (14:23:41) starting parallel lapw1 at 2017年 6月 29日 木曜日 14:23:41 JST
> -> starting parallel LAPW1 jobs at 2017年 6月 29日 木曜日 14:23:41 JST
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
> localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost(20) 20 total processes failed to start
> 0.0u 0.0s 0:00.20 10.0% 0+0k 8080+8io 23pf+0w
> Summary of lapw1para:
> localhost k=0 user=0 wallclock=0
> 0.0u 0.0s 0:02.10 0.9% 0+0k 8208+216io 24pf+0w
> > lapw2 -p (14:23:43) running LAPW2 in parallel mode
> ** LAPW2 crashed!
> 0.0u 0.0s 0:00.07 28.5% 0+0k 32+104io 0pf+0w
> error: command /home/milkbar/WIEN2k_13/lapw2para lapw2.def failed
>
> > stop error
> ------
> Is my “serial” calculation actually processed over 24 CPUs already, so this is why it is faster than Case 2? Or am I doing something wrong? Why does Case 3 crash?
>
> ====
> My second question is about PBS.
> I installed torque PBS, and created a queue:
>
> # create default queue
> qmgr -c 'create queue batch'
> qmgr -c 'set queue batch queue_type = execution'
> qmgr -c 'set queue batch started = true'
> qmgr -c 'set queue batch enabled = true'
> qmgr -c 'set queue batch resources_default.walltime = 1:00:00'
> qmgr -c 'set queue batch resources_default.nodes = 1'
> qmgr -c 'set server default_queue = batch’
>
> and followed other instructions on
> https://jabriffa.wordpress.com/2015/02/11/installing-torquepbs-job-scheduler-on-ubuntu-14-04-lts/
>
> The PBS system seems to work since I can submit very simple scripts
> and see them on qstat. My problem is that when I try to submit a
> serial wien2k job via PBS, it gives me an error (ultimately of course
> I’d like to submit them as parallel, but because of the ambiguity
> above I’ve kept it to serial) . Here's the PBS script and error message:
>
> #!/bin/tcsh
> ##PBS -A your_allocation
> # specify the allocation. Change it to your allocation
> #PBS -q batch
> #PBS -l nodes=1:ppn=20
> #PBS -l walltime=1:00:00
> #PBS -o wien2k_output
> #PBS -j oe
> #PBS -N wien2k_test
> cd $PBS_O_WORKDIR
> echo hello
> run_lapw -i 40 -ec .0001 -I
>
> Error message (contents of wien2k_output):
> hello
> /var/spool/torque/mom_priv/jobs/44.milkbar-computer.kage.SC: line 12:
> run_lapw: command not found
>
> The job is listed as complete in qstat, and the “hello” is written
> into thewien2k_output file. Changing the cd $PBS_O_WORKDIR to the path
> for the current case hasn’t changed anything. I can run run_lapwfrom
> the command line fine, though. Also, what do I write for allocation?
> (I commented it out, as I see other PBS scripts don’t always have this.)
>
> I’ve also tried the parallel case, with the following PBS script. I
> set up the .structure file and do the initialization with w2web. I
> leave the “parallel calculation” option unchecked when setting up the
> case file in w2web.
>
> #!/bin/tcsh
> ##PBS -A your_allocation
> #PBS -q batch
> #PBS -l nodes=1:ppn=20
> #PBS -l walltime=1:00:00
> #
> #PBS -o wien2k_output
> #PBS -j oe
> #PBS -N wien2k_test
> cd $PBS_O_WORKDIR
> #
> #cat $PBS_NODEFILE |cut -c1-6 >.machines_currentdd
> #set aa=`wc .machines_current`
> #echo '#' > .machines
> #
> ##example for k-point parallel lapw1/2
> set i=1
> while ($i <= $aa[1] )
> echo -n '1:' >>.machines
> head -$i .machines_current |tail -1 >> .machines
> @ i ++
> end
> echo 'granularity:1' >>.machines
> echo 'extrafine:1' >>.machines
> #
> #define here your Wien2k command
> run_lapw -p -i 40 -ec .0001 -I
>
> When I submit this job via qsub, again the job is immediately listed
> as complete in qstat, and I get the following error message in
> wien2k_output:
>
> milkbar at milkbar-computer:~/Yoji/YK-017-TiC$ cat wien2k_output
> /var/spool/torque/mom_priv/jobs/45.milkbar-computer.kage.SC: line 28:
> syntax error: unexpected end of file
>
> No .machines file has been created in the case folder.
> How can I successfully submit serial/parallel PBS jobs? Thanks in
> advance for your help.
>
> Yoji Kobayashi
>
> ==========================================================
> Yoji Kobayashi, Junior Assoc. Prof. yojik at scl.kyoto-u.ac.jp
> <mailto:yojik at scl.kyoto-u.ac.jp>
> http://www.scl.kyoto-u.ac.jp/~yojik/index.htm
> <http://www.scl.kyoto-u.ac.jp/%7Eyojik/index.htm>
>
> Kageyama Group, Dept. of Energy and Hydrocarbon Chemistry
> Graduate School of Engineering, Kyoto University
> Nishikyo-ku, Kyoto 615-8510, Japan
>
> Tel.: +81-75-383-2509 Fax: +81-75-383-2510
> http://www.ehcc.kyoto-u.ac.jp/eh10/kageyama.html
> ==========================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20170629/e1f57c19/attachment.html>
More information about the Wien
mailing list