[Wien] Problem Running Parallel Job
Peter Blaha
pblaha at theochem.tuwien.ac.at
Sat Nov 12 22:46:44 CET 2011
Lets start with your script:
It does NOT contain a line which indicates to PBS how many cores you want to use.
Eventually there is some default, at least from the dayfile I can see that you got
16 cores on a node called oliver1 (I hope, oliver1 has at least 16 cores!!)
The script you have used, creates a .machines file (check it out and
compare it to the UG to understand its meaning), which runs
lapw0 in mpi-parallel mode, but lapw1 in k-point parallel mode.
According to your dayfile, it does what you wanted.
However, it runs very badly !! and uses only 25-30% of the cpu time.
Either oliver1 is totally overloaded, or it is a poor AMD machine
or the NFS filesystem is terrible slow.
How many atoms/cell do you have ? I guess only 1-2 ?
You have more than 6000 k-points, 16*320 k-points, which is really a lot !!??
Do you think you make a meaningful calculation ? Maybe, but ....
In any case, in principle you would need only few cpu-seconds, but your system
blocks you for much longer, so that such a parallelization does not make sense.
You need to come close to 100% in the timing lines.
PS: For such a small case, mpi does not make sense anyway.
Am 12.11.2011 21:04, schrieb Chinedu Ekuma:
> Dear Dr. Laurence,
> The MPI is well compiled as other softwares uses it. You can let me know the specific information you need to help and I will send them to you. Below is the output of from
> the case.daylife file.
>
> using WIEN2k_11.1 (Release 14/6/2011) in /home/packages/wien2k/11.1/intel-11.1-mvapich-1.1/src
>
>
> start (Thu Nov 10 15:57:56 CST 2011) with lapw0 (90/99 to go)
>
> cycle 1 (Thu Nov 10 15:57:56 CST 2011) (90/99 to go)
>
>> lapw0 -p (15:57:56) starting parallel lapw0 at Thu Nov 10 15:57:56 CST 2011
> -------- .machine0 : 16 processors
> 0.080u 0.140s 0:03.81 5.7% 0+0k 0+0io 0pf+0w
>> lapw1 -p (15:58:00) starting parallel lapw1 at Thu Nov 10 15:58:00 CST 2011
> -> starting parallel LAPW1 jobs at Thu Nov 10 15:58:00 CST 2011
> running LAPW1 in parallel mode (using .machines)
> 16 number_of_parallel_jobs
> oliver1(380) 26.000u 3.270s 1:29.71 32.63% 0+0k 0+0io 0pf+0w
> oliver1(380) 25.980u 4.220s 1:38.81 30.56% 0+0k 0+0io 0pf+0w
> oliver1(380) 26.380u 3.650s 1:48.06 27.79% 0+0k 0+0io 0pf+0w
> oliver1(380) 25.940u 3.190s 1:28.54 32.90% 0+0k 0+0io 0pf+0w
> oliver1(380) 25.610u 3.340s 1:59.38 24.25% 0+0k 0+0io 0pf+0w
> oliver1(380) 25.890u 3.260s 1:51.26 26.20% 0+0k 0+0io 0pf+0w
> oliver1(380) 26.030u 3.460s 1:44.84 28.13% 0+0k 0+0io 0pf+0w
> oliver1(380) 25.890u 3.100s 1:45.37 27.51% 0+0k 0+0io 0pf+0w
> oliver1(380) 25.370u 3.340s 1:50.97 25.87% 0+0k 0+0io 0pf+0w
> oliver1(380) 25.850u 4.720s 1:53.82 26.86% 0+0k 0+0io 0pf+0w
> oliver1(380) 25.710u 3.120s 1:35.85 30.08% 0+0k 0+0io 0pf+0w
> oliver1(380) 26.060u 3.390s 1:44.48 28.19% 0+0k 0+0io 0pf+0w
> oliver1(380) 25.480u 3.310s 1:46.31 27.08% 0+0k 0+0io 0pf+0w
> oliver1(380) 25.430u 3.360s 1:46.49 27.03% 0+0k 0+0io 0pf+0w
> oliver1(380) 25.580u 3.160s 1:48.18 26.56% 0+0k 0+0io 0pf+0w
> oliver1(380) 25.250u 3.270s 1:46.57 26.76% 0+0k 0+0io 0pf+0w
> oliver1(1) 0.270u 0.010s 0.47 58.95% 0+0k 0+0io 0pf+0w
> oliver1(1) 0.240u 0.040s 0.47 59.20% 0+0k 0+0io 0pf+0w
> oliver1(1) 0.240u 0.020s 0.47 54.74% 0+0k 0+0io 0pf+0w
> oliver1(1) 0.250u 0.000s 0.76 32.85% 0+0k 0+0io 0pf+0w
> Summary of lapw1para:
> oliver1 k=6084 user=413.45 wallclock=2014.58
> 0.270u 1.870s 2:06.40 1.6% 0+0k 0+0io 0pf+0w
>> lapw2 -p (16:00:07) running LAPW2 in parallel mode
> oliver1 12.990u 0.970s 16.61 84.05% 0+0k 0+0io 0pf+0w
> oliver1 13.890u 1.780s 24.15 64.86% 0+0k 0+0io 0pf+0w
> oliver1 13.140u 1.360s 28.73 50.46% 0+0k 0+0io 0pf+0w
> oliver1 14.740u 4.600s 54.22 35.67% 0+0k 0+0io 0pf+0w
> oliver1 14.060u 1.030s 52.49 28.74% 0+0k 0+0io 0pf+0w
> oliver1 13.330u 1.000s 57.72 24.83% 0+0k 0+0io 0pf+0w
> oliver1 14.340u 0.870s 1:05.97 23.05% 0+0k 0+0io 0pf+0w
> oliver1 13.420u 1.040s 1:06.51 21.74% 0+0k 0+0io 0pf+0w
> oliver1 13.850u 1.050s 1:13.63 20.23% 0+0k 0+0io 0pf+0w
> oliver1 13.320u 1.110s 1:07.55 21.36% 0+0k 0+0io 0pf+0w
> oliver1 13.100u 1.100s 1:11.05 19.98% 0+0k 0+0io 0pf+0w
> oliver1 13.980u 1.000s 1:09.53 21.54% 0+0k 0+0io 0pf+0w
> oliver1 12.980u 1.170s 1:06.62 21.24% 0+0k 0+0io 0pf+0w
> oliver1 13.150u 1.300s 1:07.4 21.44% 0+0k 0+0io 0pf+0w
> oliver1 13.940u 0.980s 1:07.78 22.01% 0+0k 0+0io 0pf+0w
> oliver1 13.280u 1.080s 1:03.40 22.65% 0+0k 0+0io 0pf+0w
> oliver1 0.080u 0.100s 3.32 5.41% 0+0k 0+0io 0pf+0w
> oliver1 0.110u 0.040s 3.17 4.72% 0+0k 0+0io 0pf+0w
> oliver1 0.090u 0.040s 2.43 5.34% 0+0k 0+0io 0pf+0w
> oliver1 0.110u 0.030s 2.52 5.54% 0+0k 0+0io 0pf+0w
> Summary of lapw2para:
> oliver1 user=217.9 wallclock=15710.7
> 3.670u 5.790s 1:34.56 10.0% 0+0k 0+0io 5pf+0w
>> lcore (16:01:41) 0.030u 0.000s 0:00.16 18.7% 0+0k 0+0io 0pf+0w
>> mixer (16:01:42) 0.030u 0.040s 0:00.28 25.0% 0+0k 0+0io 0pf+0w
> :ENERGY convergence: 0 0 .0169862050000000
> :CHARGE convergence: 0 0.0001 .2566193
> ec cc and fc_conv 1 0 1
>
>
> Regards**/?/**
> *Chinedu Ekuma Ekuma*
> Department of Physics and Astronomy
> Louisiana State University
> 202 Nicholson Hall, Tower Dr
> Baton Rouge, Louisiana, 70803-4001
> /Phone (Mobile): +12254390766 /
> /
> /
> /...The Ways of God are Mysterious/
> /As Always/
> /I wish you God's PANACEA
> /
> /
> /
> ////
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *From:* Laurence Marks <L-marks at northwestern.edu>
> *To:* A Mailing list for WIEN2k users <wien at zeus.theochem.tuwien.ac.at>
> *Sent:* Saturday, November 12, 2011 2:14 PM
> *Subject:* Re: [Wien] Problem Running Parallel Job
>
> Wien2k works in parallel, which means that
> a) The script is wrong
> b) You do not have mpi compiled
> c) Your OS/pbs is different
> d) Something else
>
> Without more information nobody can help you -- saying "that it does
> not work" is not enough.
>
> 2011/11/12 Chinedu Ekuma <panaceamee at yahoo.com <mailto:panaceamee at yahoo.com>>:
> > Dear Wien2k Users,
> > We recently installed wien2k version 11 on my cluster to run in parallel.
> > Below is the pbs script we use for it but our computer administrator has
> > said that it does not run in parallel. Could you kindly help us with
> > rectifying the problem. Thanks in anticipation for your help.
> >
> >
> > #PBS -l walltime=14:20:0
> > #PBS -j oe
> > #PBS -N
> > #This is necessary on my pbs cluster:
> > #setenv SCRATCH /scratch/
> >
> > # change into your working directory
> > cd "$WORKDIR"
> >
> > #start creating .machines
> > cat $PBS_NODEFILE |cut -c1-7 >.machines_current
> > set aa=`wc .machines_current`
> > echo '#' > .machines
> >
> > # example for an MPI parallel lapw0
> > echo -n 'lapw0:' >> .machines
> > set i=1
> > while ($i < $aa[1] )
> > echo -n `cat $PBS_NODEFILE |head -$i | tail -1` ' ' >>.machines
> > @ i ++
> > end
> > echo `cat $PBS_NODEFILE |head -$i|tail -1` ' ' >>.machines
> >
> > #example for k-point parallel lapw1/2
> > set i=1
> > while ($i <= $aa[1] )
> > echo -n '1:' >>.machines
> > head -$i .machines_current |tail -1 >> .machines
> > @ i ++
> > end
> > echo 'granularity:1' >>.machines
> > echo 'extrafine:1' >>.machines
> >
> > #define here your WIEN2k command
> >
> > runsp_lapw -p -i 40 -cc 0.0001 -I
> >
> >
> > Regards?
> > Chinedu Ekuma Ekuma
> >
> >
> >
> > _______________________________________________
> > Wien mailing list
> > Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >
> >
>
>
>
> --
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> www.numis.northwestern.edu <http://www.numis.northwestern.edu> 1-847-491-3996
> "Research is to see what everybody else has seen, and to think what
> nobody else has thought"
> Albert Szent-Gyorgi
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at <mailto:Wien at zeus.theochem.tuwien.ac.at>
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
--
-----------------------------------------
Peter Blaha
Inst. Materials Chemistry, TU Vienna
Getreidemarkt 9, A-1060 Vienna, Austria
Tel: +43-1-5880115671
Fax: +43-1-5880115698
email: pblaha at theochem.tuwien.ac.at
-----------------------------------------
More information about the Wien
mailing list