[Wien] Parallel job failed

Hu, Qing Miao qmhu at fhi-berlin.mpg.de
Mon Feb 16 10:48:47 CET 2004


Dear WIEN2k users:

I have run parallel (k point) jobs  on Compaq  machines (OS: tru64 unix V5.1a,
with Compaq fortran complier f90, 4 cpu per node), and encountered a 'strange'
problem. The job was submitted by PBS. After some times of SCF cycles, the
'qstat' command showed that the job was still running, but the 'top' command
gave the information that there was only 'lapw1para' or 'lapw2para' sleeping on
the machine but no its children 'lapw1' or 'lapw2'.   I have tried the job with
the same input files again, in some cases, it finished successfully, but in
other cases, the same problem occured again. Did anybody have the similar
problem and have some idea about it? I will be very greatful for your help.
My '.machines' file and the 'case.dayfile' are attached for your reference.

Sincerely yours,
Qing Miao Hu

*********** .machines ***********
# This is a valid .machines file
#
granularity:1
1:th18
1:th18
1:th18
1:th18

************ case.dayfile ***********
Calculating SiteBrd in /scratch/qmhu/theta03/PBE/SiteBrd
on th18.rz-berlin.mpg.de

    start       (Mon Feb 16 10:08:13 CET 2004) with lapw0 (50/20 to go)
>   lapw0 -p    (10:08:13) starting parallel lapw0 at Mon Feb 16 10:08:13 CET
2004
--------
running lapw0 in single mode
162.79u 2.67s 2:45 99% 0+839k 110+2270io 106pf+0w
>   lapw1  -p   (10:10:59) starting parallel lapw1 at Mon Feb 16 10:10:59 CET
2004
->  starting parallel LAPW1 jobs at Mon Feb 16 10:10:59 CET 2004
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
[1] 243331
[2] 243492
[1]  - Exit -1              ( $remote $machine[$p]  ...
[3] 243377
[2]  - Exit -1              ( $remote $machine[$p]  ...
[4] 243523
[3]  - Exit -1              ( $remote $machine[$p]  ...
[4]  + Exit -1              ( $remote $machine[$p]  ...






More information about the Wien mailing list