[Wien] Parallel job failed
Torsten Andersen
thor at physik.uni-kl.de
Mon Feb 16 12:15:53 CET 2004
Usually, "exit 1" from a program indicates that it has terminated with a
problem. It can be from $remote or lapw1 here... Check your
lapw1_1.error (and the other error files) - do they contain things like
"out of memory" or other messages? In any case, check your
environment... and if $remote (whatever it is set to) is a valid program
to log into th18 without a password.
Best regards,
Torsten Andersen.
Hu, Qing Miao wrote:
> Dear WIEN2k users:
>
> I have run parallel (k point) jobs on Compaq machines (OS: tru64 unix V5.1a,
> with Compaq fortran complier f90, 4 cpu per node), and encountered a 'strange'
> problem. The job was submitted by PBS. After some times of SCF cycles, the
> 'qstat' command showed that the job was still running, but the 'top' command
> gave the information that there was only 'lapw1para' or 'lapw2para' sleeping on
> the machine but no its children 'lapw1' or 'lapw2'. I have tried the job with
> the same input files again, in some cases, it finished successfully, but in
> other cases, the same problem occured again. Did anybody have the similar
> problem and have some idea about it? I will be very greatful for your help.
> My '.machines' file and the 'case.dayfile' are attached for your reference.
>
> Sincerely yours,
> Qing Miao Hu
>
> *********** .machines ***********
> # This is a valid .machines file
> #
> granularity:1
> 1:th18
> 1:th18
> 1:th18
> 1:th18
>
> ************ case.dayfile ***********
> Calculating SiteBrd in /scratch/qmhu/theta03/PBE/SiteBrd
> on th18.rz-berlin.mpg.de
>
> start (Mon Feb 16 10:08:13 CET 2004) with lapw0 (50/20 to go)
>
>> lapw0 -p (10:08:13) starting parallel lapw0 at Mon Feb 16 10:08:13 CET
>
> 2004
> --------
> running lapw0 in single mode
> 162.79u 2.67s 2:45 99% 0+839k 110+2270io 106pf+0w
>
>> lapw1 -p (10:10:59) starting parallel lapw1 at Mon Feb 16 10:10:59 CET
>
> 2004
> -> starting parallel LAPW1 jobs at Mon Feb 16 10:10:59 CET 2004
> running LAPW1 in parallel mode (using .machines)
> 4 number_of_parallel_jobs
> [1] 243331
> [2] 243492
> [1] - Exit -1 ( $remote $machine[$p] ...
> [3] 243377
> [2] - Exit -1 ( $remote $machine[$p] ...
> [4] 243523
> [3] - Exit -1 ( $remote $machine[$p] ...
> [4] + Exit -1 ( $remote $machine[$p] ...
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
--
Dr. Torsten Andersen TA-web: http://deep.at/myspace/
AG Hübner, Department of Physics, Kaiserslautern University, and
Condensed Matter Theory Group, Department of Physics, Uppsala University
Web: http://www.fysik4.fysik.uu.se/ http://www.physik.uni-kl.de/
More information about the Wien
mailing list