[Wien] k-point parallel problem
Peter Blaha
pblaha at zeus.theochem.tuwien.ac.at
Thu Jan 6 08:26:09 CET 2005
You have a problem with your queuing system. Aparently there is a time
limit of 60 seconds, so small jobs (Si) run, longer don't.
You need to check with your system manager.
> Dear Peter Blaha,
>
> This time we calculate the bulk of Si. which is much smaller.there is no problem in k-point parral. Only in the end the state flag is "E" when we use "qstat -a". We use 12 nodes and 24 CPUs, 74 k-points.
> Then we test a larger system (16 atoms).
> 1,we run the program directly in the masternode. " run_lapw -i 1 -ec 0.0001", there is no problem.
> 2,Then we submit the program using pbs and run it in serial model.
> qsub ./wienifc.pbs -l nodes=1:ppn=1 -q high -N sisiy4
> the command line in wienifc.pbs is: "run_lapw -i 1 -ec 0.0001 >run.output"
> I use the "top" comand look at the process. It gave an error message after the lapw1c run 1min.
> "You should submit the program use pbs."
>
> the dayfile is:
>
> Calculating sisiy4 in /people/gong/lyhua/ADP/sisiy4
> on comp10 with PID 19621
>
> start (Thu Jan 6 11:13:48 CST 2005) with lapw0 (1/20 to go)
> > lapw0 (11:13:48) 31.840u 0.310s 0:32.58 98.6% 0+0k 0+0io 172pf+0w
> > lapw1 -c (11:14:20) Killed
> 38.870u 0.680s 0:39.85 99.2% 0+0k 0+0io 238pf+0w
>
> > stop error
> ~
> the lapw1.error is
>
> Error in LAPW1
>
> 3, then we test the parallel model.
> qsub ./wienifc.pbs -l nodes=2:ppn=1 -q high -N sisiy4
> the comand line in wienifc.pbs is:
> "run_lapw -i 1 -ec 0.0001 -p > run.output"
> as above it give an error message:
> "You should submit the program use pbs"
>
> ______________________________
> the case.dayfile is:
>
>
> Calculating sisiy4 in /people/gong/lyhua/ADP/sisiy4
> on comp10 with PID 16465
>
> start (Thu Jan 6 10:38:00 CST 2005) with lapw0 (1/20 to go)
> > lapw0 -p (10:38:00) starting parallel lapw0 at Thu Jan 6 10:38:00 CST 2005
> --------
> running lapw0 in single mode
> 32.020u 0.350s 0:33.17 97.5% 0+0k 0+0io 1823pf+0w
> > lapw1 -c -p (10:38:33) starting parallel lapw1 at Thu Jan 6 10:38:33 CST 2005
> -> starting parallel LAPW1 jobs at Thu Jan 6 10:38:33 CST 2005
> running LAPW1 in parallel mode (using .machines)
> 2 number_of_parallel_jobs
> ** LAPW1 crashed!
> 0.150u 0.220s 1:28.34 0.4% 0+0k 0+0io 21900pf+0w
>
> > stop error
> _______________________________________
>
> the lapw1.error is:
>
>
> ** Error in Parallel LAPW1
> ** LAPW1 STOPPED at Thu Jan 6 10:40:01 CST 2005
> ** check ERROR FILES!
> Error in LAPW1
> Error in LAPW1
> _______________________________________
>
> the end of case.output1 is:
>
> 1.3497807 1.3891706 1.3891808 1.3933397 1.3935096
> 1.4114150 1.4118608 1.4128885 1.4133827
> 0 EIGENVALUES BELOW THE ENERGY -7.00000
> ********************************************************
>
> NUMBER OF K-POINTS: 1
> ===> TOTAL CPU TIME: 204.3 (INIT = 0.8 + K-POINTS = 203.6)
> > SUM OF WALL CLOCK TIMES: 206.6 (INIT = 1.0 + K-POINTS = 205.5)
> Maximum WALL clock time: 207.062600851059
> Maximum CPU time: 204.480000000000
> ________________________________________
>
> the case.scf1 is empty.
>
>
>
>
> >First: please use a number of processors which is compatible with the
> >number of k-points you have (check case.klist). What I mean is: I suppose
> >you have 18 k-points, so reasonable machine-numbers are 18 (each processor
> >does 1 k-point), 9 (each does 2), 6, 3 or 2.
> >Of course you can use 16, but the program will be not faster (in fact even
> >slower since summation takes longer) than with 9 processors.
> >
> >Second: Your message is not quite clear: it failed when run the lapw1para...
> > There is no problem when run lapw1 in parral. ???
> >
> >Third: In your script please change (sorry, it was incorrect on the faq page
> > but it should not be responsible for the problems)
> >echo 'extrafine' >>.machines to
> >echo 'extrafine:1' >>.machines
> >
> >Fourth: Could be a timelimit the cause for these problems ? Increase the
> >cpu-time-limit of the pbs job.
> >In addition your pbs job should produce output and error files and they may
> >contain further information.
> >
> >
> >
> > P.Blaha
> >--------------------------------------------------------------------------
> >Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> >Phone: +43-1-58801-15671 FAX: +43-1-58801-15698
> >Email: blaha at theochem.tuwien.ac.at WWW: http://info.tuwien.ac.at/theochem/
> >--------------------------------------------------------------------------
> >
> >_______________________________________________
> >Wien mailing list
> >Wien at zeus.theochem.tuwien.ac.at
> >http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
> = = = = = = = = = = = = = = = = = = = =
>
>
> ?
> ?!
>
>
> liyh
> lyhua at fudan.edu.cn
> 2005-01-06
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-15671 FAX: +43-1-58801-15698
Email: blaha at theochem.tuwien.ac.at WWW: http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------
More information about the Wien
mailing list