[Wien] Parallel job failed

Torsten Andersen thor at physik.uni-kl.de
Mon Feb 16 12:15:53 CET 2004


Usually, "exit 1" from a program indicates that it has terminated with a 
problem. It can be from $remote or lapw1 here... Check your 
lapw1_1.error (and the other error files) - do they contain things like 
"out of memory" or other messages? In any case, check your 
environment... and if $remote (whatever it is set to) is a valid program 
to log into th18 without a password.

Best regards,
Torsten Andersen.

Hu, Qing Miao wrote:
> Dear WIEN2k users:
> 
> I have run parallel (k point) jobs  on Compaq  machines (OS: tru64 unix V5.1a,
> with Compaq fortran complier f90, 4 cpu per node), and encountered a 'strange'
> problem. The job was submitted by PBS. After some times of SCF cycles, the
> 'qstat' command showed that the job was still running, but the 'top' command
> gave the information that there was only 'lapw1para' or 'lapw2para' sleeping on
> the machine but no its children 'lapw1' or 'lapw2'.   I have tried the job with
> the same input files again, in some cases, it finished successfully, but in
> other cases, the same problem occured again. Did anybody have the similar
> problem and have some idea about it? I will be very greatful for your help.
> My '.machines' file and the 'case.dayfile' are attached for your reference.
> 
> Sincerely yours,
> Qing Miao Hu
> 
> *********** .machines ***********
> # This is a valid .machines file
> #
> granularity:1
> 1:th18
> 1:th18
> 1:th18
> 1:th18
> 
> ************ case.dayfile ***********
> Calculating SiteBrd in /scratch/qmhu/theta03/PBE/SiteBrd
> on th18.rz-berlin.mpg.de
> 
>     start       (Mon Feb 16 10:08:13 CET 2004) with lapw0 (50/20 to go)
> 
>>  lapw0 -p    (10:08:13) starting parallel lapw0 at Mon Feb 16 10:08:13 CET
> 
> 2004
> --------
> running lapw0 in single mode
> 162.79u 2.67s 2:45 99% 0+839k 110+2270io 106pf+0w
> 
>>  lapw1  -p   (10:10:59) starting parallel lapw1 at Mon Feb 16 10:10:59 CET
> 
> 2004
> ->  starting parallel LAPW1 jobs at Mon Feb 16 10:10:59 CET 2004
> running LAPW1 in parallel mode (using .machines)
> 4 number_of_parallel_jobs
> [1] 243331
> [2] 243492
> [1]  - Exit -1              ( $remote $machine[$p]  ...
> [3] 243377
> [2]  - Exit -1              ( $remote $machine[$p]  ...
> [4] 243523
> [3]  - Exit -1              ( $remote $machine[$p]  ...
> [4]  + Exit -1              ( $remote $machine[$p]  ...
> 
> 
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> 

-- 
Dr. Torsten Andersen                     TA-web: http://deep.at/myspace/
AG Hübner, Department of Physics, Kaiserslautern University, and
Condensed Matter Theory Group, Department of Physics, Uppsala University
Web: http://www.fysik4.fysik.uu.se/         http://www.physik.uni-kl.de/ 






More information about the Wien mailing list