[Wien] Parallelization and PBS on a single computer

Yoji Kobayashi yojik at scl.kyoto-u.ac.jp
Fri Jun 30 09:48:30 CEST 2017


Dear Peter and Gavin,

Thank you for your help. Of course i went over the UG but your explanations cleared things up. I will be eventually doing supercell calculations on TiH and Ti surfaces so will look into the MPI errors in detail then. The PBS works fine now, with the #PBS -V command too. Many thanks again.

Yoji


> On Jun 29, 2017, at 14:49, Yoji Kobayashi <yojik at scl.kyoto-u.ac.jp> wrote:
> 
> Dear Users,
> 
> I have a some questions/problems regarding parallelization and PBS. 
> I’m not sure if I’m really running parallel vs. serial, and my PBS script isn’t working.
> 
> ===
> My system info:
> Intel Xeon CPU E5-2630 v2 @2.6 GHz, 24 CPUS
> Memory: 32GB
> Running Wien2k_13, on Ubuntu 14.04.03
> File system: ext4
> (This is considered a single node with 24 processors?)
> ===
> My first question is, am I really running a parallel calculation in a meaningful way?
> 
> What I try:
> In w2web, a serial calculation (SCF only)  for the TiC example  (500 k points) takes about 25 sec. to converge.
> I do the same calculation (starting with a new case) but setting parallelization in w2web, with slightly different .machine files for each case:
> 
> Case 1:
> 1:localhost
> 
> Case 2 (i.e. 20 lines of below):
> 1:localhost
> 1:localhost
>> 1:localhost
> 1:localhost
> 
> Case 3
> 1:localhost:20
> 
> (no lines referring to granularity, etc for now)
> 
> What I get:
> Case 1 computes in about 54 sec;
> Case 2 computes in 1min23 sec.;
> Case 3 gives an error in running lapw2, see the dayfile below:
> -----
> Calculating YK-016-TiC in /home/milkbar/Yoji/YK-016-TiC
> on milkbar-computer with PID 18077
> using WIEN2k_13.1 (Release 17/6/2013) in /home/milkbar/WIEN2k_13
> 
> 
>     start 	(2017年  6月 29日 木曜日 14:23:39 JST) with lapw0 (40/99 to go)
> 
>     cycle 1 	(2017年  6月 29日 木曜日 14:23:39 JST) 	(40/99 to go)
> 
> >   lapw0 -p	(14:23:39) starting parallel lapw0 at 2017年  6月 29日 木曜日 14:23:39 JST
> -------- .machine0 : processors
> running lapw0 in single mode
> 1.7u 0.0s 0:01.84 98.3% 0+0k 16+440io 0pf+0w
> >   lapw1  -p    	(14:23:41) starting parallel lapw1 at 2017年  6月 29日 木曜日 14:23:41 JST
> ->  starting parallel LAPW1 jobs at 2017年  6月 29日 木曜日 14:23:41 JST
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
>      localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost localhost(20) 20 total processes failed to start
> 0.0u 0.0s 0:00.20 10.0% 0+0k 8080+8io 23pf+0w
>    Summary of lapw1para:
>    localhost	 k=0	 user=0	 wallclock=0
> 0.0u 0.0s 0:02.10 0.9% 0+0k 8208+216io 24pf+0w
> >   lapw2 -p     	(14:23:43) running LAPW2 in parallel mode
> **  LAPW2 crashed!
> 0.0u 0.0s 0:00.07 28.5% 0+0k 32+104io 0pf+0w
> error: command   /home/milkbar/WIEN2k_13/lapw2para lapw2.def   failed
> 
> >   stop error
> ------
> Is my “serial” calculation actually processed over 24 CPUs already, so this is why it is faster than Case 2? Or am I doing something wrong? Why does Case 3 crash? 
> 
> ====
> My second question is about PBS.
> I installed torque PBS, and created a queue:
> 
> # create default queue
>  qmgr -c 'create queue batch'
>  qmgr -c 'set queue batch queue_type = execution'
>  qmgr -c 'set queue batch started = true'
>  qmgr -c 'set queue batch enabled = true'
>  qmgr -c 'set queue batch resources_default.walltime = 1:00:00'
>  qmgr -c 'set queue batch resources_default.nodes = 1'
>  qmgr -c 'set server default_queue = batch’
> 
> and followed other instructions on
> https://jabriffa.wordpress.com/2015/02/11/installing-torquepbs-job-scheduler-on-ubuntu-14-04-lts/ <https://jabriffa.wordpress.com/2015/02/11/installing-torquepbs-job-scheduler-on-ubuntu-14-04-lts/>
> 
> The PBS system seems to work since I can submit very simple scripts and see them on qstat. My problem is that when I try to submit a serial wien2k job via PBS, it gives me an error (ultimately of course I’d like to submit them as parallel, but because of the ambiguity above I’ve kept it to serial) . Here's the PBS script and error message:
> 
>  #!/bin/tcsh
>  ##PBS -A your_allocation
>  # specify the allocation. Change it to your allocation
>  #PBS -q batch
>  #PBS -l nodes=1:ppn=20
>  #PBS -l walltime=1:00:00
>  #PBS -o wien2k_output
>  #PBS -j oe
>  #PBS -N wien2k_test
>  cd $PBS_O_WORKDIR
>  echo hello
>  run_lapw -i 40 -ec .0001 -I
> 
> Error message (contents of wien2k_output):
> hello
> /var/spool/torque/mom_priv/jobs/44.milkbar-computer.kage.SC: line 12: run_lapw: command not found
> 
> The job is listed as complete in qstat, and the “hello” is written into the wien2k_output file. Changing the cd $PBS_O_WORKDIR to the path for the current case hasn’t changed anything. I can run run_lapw from the command line fine, though. Also, what do I write for allocation? (I commented it out, as I see other PBS scripts don’t always have this.)
> 
> I’ve also tried the parallel case, with the following PBS script. I set up the .structure file and do the initialization with w2web. I leave the “parallel calculation” option unchecked when setting up the case file in w2web.
> 
>  #!/bin/tcsh
>  ##PBS -A your_allocation
>  #PBS -q batch
>  #PBS -l nodes=1:ppn=20
>  #PBS -l walltime=1:00:00
>  #
>  #PBS -o wien2k_output
>  #PBS -j oe
>  #PBS -N wien2k_test
>  cd $PBS_O_WORKDIR
>  #
>  #cat $PBS_NODEFILE |cut -c1-6 >.machines_currentdd
>  #set aa=`wc .machines_current`
>  #echo '#' > .machines
>  #
>  ##example for k-point parallel lapw1/2
>  set i=1
>  	while ($i <= $aa[1] )
>  	echo -n '1:' >>.machines
>  	head -$i .machines_current |tail -1 >> .machines
>  	@ i ++
>  end
> echo 'granularity:1' >>.machines
> echo 'extrafine:1' >>.machines
>  #
>  #define here your Wien2k command
>  run_lapw -p -i 40 -ec .0001 -I
> 
> When I submit this job via qsub, again the job is immediately listed as complete in qstat, and I get the following error message in wien2k_output:
> 
> milkbar at milkbar-computer:~/Yoji/YK-017-TiC$ cat wien2k_output
> /var/spool/torque/mom_priv/jobs/45.milkbar-computer.kage.SC: line 28: syntax error: unexpected end of file
> 
> No .machines file has been created in the case folder. 
>  How can I successfully submit serial/parallel PBS jobs? Thanks in advance for your help.
> 
> Yoji Kobayashi
> 
> ==========================================================
> Yoji Kobayashi, Junior Assoc. Prof.       yojik at scl.kyoto-u.ac.jp <mailto:yojik at scl.kyoto-u.ac.jp>
> http://www.scl.kyoto-u.ac.jp/~yojik/index.htm <http://www.scl.kyoto-u.ac.jp/~yojik/index.htm>
> 
> Kageyama Group, Dept. of Energy and Hydrocarbon Chemistry
> Graduate School of Engineering, Kyoto University
> Nishikyo-ku, Kyoto 615-8510, Japan
> 
> Tel.: +81-75-383-2509     Fax: +81-75-383-2510
> http://www.ehcc.kyoto-u.ac.jp/eh10/kageyama.html
> ==========================================================
> 

==========================================================
小林洋治  yojik at scl.kyoto-u.ac.jp
http://www.scl.kyoto-u.ac.jp/~yojik/index.htm

〒615-8510 京都市西京区 京都大学桂
京都大学 大学院工学研究科 物質エネルギー化学専攻
陰山研究室 講師

Tel.: 075-383-2509    Fax: 075-383-2510
http://www.ehcc.kyoto-u.ac.jp/eh10/kageyama.html
==========================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20170630/21cdefac/attachment-0001.html>


More information about the Wien mailing list