[Wien] k-point parallel problem

Fri Jan 7 02:16:44 CET 2005

Dear Blaha,

        Could you tell me which parameter we should change in our queuing sysytem. We are using a pbs system.
Then I will talk about it with our system manager.

Best wish to you!
Yonghua

>You have a problem with your queuing system. Aparently there is a time 
>limit of 60 seconds, so small jobs (Si) run, longer don't. 
>You need to check with your system manager.
>
>> Dear Peter Blaha,
>> 
>>        This time we calculate the bulk of Si. which is much smaller.there is no problem in k-point parral. Only in the end the state flag is "E" when we use "qstat -a". We use 12 nodes and 24 CPUs, 74 k-points.
>>        Then we test a larger system (16 atoms).
>> 1,we run the program directly in the masternode. " run_lapw -i 1 -ec 0.0001", there is no problem.
>> 2,Then we submit the program using pbs and run it in serial model.
>>    qsub ./wienifc.pbs -l nodes=1:ppn=1 -q high -N sisiy4
>>    the command line in wienifc.pbs is: "run_lapw -i 1 -ec 0.0001 >run.output"
>>    I use the "top" comand look at the process. It gave an error message after the lapw1c run 1min.
>>   "You should submit the program use pbs."
>> 
>> the dayfile is:
>> 
>> Calculating sisiy4 in /people/gong/lyhua/ADP/sisiy4
>> on comp10 with PID 19621
>> 
>>     start       (Thu Jan  6 11:13:48 CST 2005) with lapw0 (1/20 to go)
>> >   lapw0       (11:13:48) 31.840u 0.310s 0:32.58 98.6% 0+0k 0+0io 172pf+0w
>> >   lapw1  -c   (11:14:20) Killed
>> 38.870u 0.680s 0:39.85 99.2%    0+0k 0+0io 238pf+0w
>> 
>> >   stop error
>> ~
>> the lapw1.error is
>> 
>> Error in LAPW1
>> 
>> 3, then we test the parallel model.
>>    qsub ./wienifc.pbs -l nodes=2:ppn=1 -q high -N sisiy4
>>   the comand line in wienifc.pbs is:
>>   "run_lapw -i 1 -ec 0.0001 -p > run.output"
>>   as above it give an error message:
>>   "You should submit the program use pbs"
>> 
>> ______________________________
>> the case.dayfile is:
>> 
>> 
>> Calculating sisiy4 in /people/gong/lyhua/ADP/sisiy4
>> on comp10 with PID 16465
>> 
>>     start       (Thu Jan  6 10:38:00 CST 2005) with lapw0 (1/20 to go)
>> >   lapw0 -p    (10:38:00) starting parallel lapw0 at Thu Jan  6 10:38:00 CST 2005
>> --------
>> running lapw0 in single mode
>> 32.020u 0.350s 0:33.17 97.5%    0+0k 0+0io 1823pf+0w
>> >   lapw1  -c -p        (10:38:33) starting parallel lapw1 at Thu Jan  6 10:38:33 CST 2005
>> ->  starting parallel LAPW1 jobs at Thu Jan  6 10:38:33 CST 2005
>> running LAPW1 in parallel mode (using .machines)
>> 2 number_of_parallel_jobs
>> **  LAPW1 crashed!
>> 0.150u 0.220s 1:28.34 0.4%      0+0k 0+0io 21900pf+0w
>> 
>> >   stop error
>> _______________________________________
>> 
>> the lapw1.error is:
>> 
>> 
>>  **  Error in Parallel LAPW1
>> **  LAPW1 STOPPED at Thu Jan 6 10:40:01 CST 2005
>> **  check ERROR FILES!
>> Error in LAPW1
>> Error in LAPW1              
>> _______________________________________
>>  
>> the end of case.output1 is:
>> 
>>   1.3497807    1.3891706    1.3891808    1.3933397    1.3935096
>>       1.4114150    1.4118608    1.4128885    1.4133827
>>             0 EIGENVALUES BELOW THE ENERGY   -7.00000
>>        ********************************************************
>> 
>>        NUMBER OF K-POINTS:           1
>>    ===> TOTAL CPU       TIME:    204.3 (INIT =      0.8 + K-POINTS =    203.6)
>>    > SUM OF WALL CLOCK TIMES:    206.6 (INIT =      1.0 + K-POINTS =    205.5)
>>       Maximum WALL clock time:    207.062600851059
>>       Maximum CPU time:           204.480000000000 
>> ________________________________________
>>  
>> the case.scf1 is empty.    
>>         	
>> 
>> 
>> 
>> >First: please use a number of processors which is compatible with the
>> >number of k-points you have (check case.klist). What I mean is: I suppose
>> >you have 18 k-points, so reasonable machine-numbers are 18 (each processor
>> >does 1 k-point), 9 (each does 2), 6, 3 or 2.
>> >Of course you can use 16, but the program will be not faster (in fact even 
>> >slower since summation takes longer) than with 9 processors.
>> >
>> >Second: Your message is not quite clear: it failed when run the lapw1para...
>> >        There is no problem when run lapw1 in parral.   ???
>> >
>> >Third: In your script please change (sorry, it was incorrect on the faq page
>> >       but it should not be responsible for the problems) 
>> >echo 'extrafine' >>.machines         to
>> >echo 'extrafine:1' >>.machines
>> >                       
>> >Fourth: Could be a timelimit the cause for these problems ? Increase the
>> >cpu-time-limit of the pbs job. 
>> >In addition your pbs job should produce output and error files and they may 
>> >contain further information.
>> >
>> >
>> >
>> >                                      P.Blaha
>> >--------------------------------------------------------------------------
>> >Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
>> >Phone: +43-1-58801-15671             FAX: +43-1-58801-15698
>> >Email: blaha at theochem.tuwien.ac.at    WWW: http://info.tuwien.ac.at/theochem/
>> >--------------------------------------------------------------------------
>> >
>> >_______________________________________________
>> >Wien mailing list
>> >Wien at zeus.theochem.tuwien.ac.at
>> >http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> 
>> = = = = = = = = = = = = = = = = = = = =
>> 			
>> 
>>         ?
>> ?!
>>  
>> 				 
>>         liyh
>>         lyhua at fudan.edu.cn
>>           2005-01-06
>> 
>> 
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> 
>
>
>                                      P.Blaha
>--------------------------------------------------------------------------
>Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
>Phone: +43-1-58801-15671             FAX: +43-1-58801-15698
>Email: blaha at theochem.tuwien.ac.at    WWW: http://info.tuwien.ac.at/theochem/
>--------------------------------------------------------------------------
>
>_______________________________________________
>Wien mailing list
>Wien at zeus.theochem.tuwien.ac.at
>http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

= = = = = = = = = = = = = = = = = = = =

　　　　　　　　致
礼！

　　　　　　　　liyh
　　　　　　　　lyhua at fudan.edu.cn
　　　　　　　　　　2005-01-07