[Wien] parell calculation error?(single computer ok)

lin zhu zhuwien at yahoo.com.cn
Wed May 18 04:35:11 CEST 2005


Dear all:
       I use the supercomputer to calculate,I use four computers to calculate parelly, But there are always an error as following(case.dayfile and nohup.out file):
case.dayfile:
*************************************************************************************
Calculating CoCNP in /home/zul3/CoCNP
on n12
    start  (Fri May 13 18:03:48 PDT 2005) with lapw0 (50/20 to go)
>   lapw0 -p (18:03:48) starting parallel lapw0 at Fri May 13 18:03:48 PDT 2005
--------
running lapw0 in single mode
34.667u 0.156s 0:35.91 96.9% 0+0k 0+0io 1pf+0w
>   lapw1  -up -p   (18:04:24) starting parallel lapw1 at Fri May 13 18:04:24 PDT 2005
->  starting parallel LAPW1 jobs at Fri May 13 18:04:24 PDT 2005
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
     192.20.110.113(6)      192.20.110.102(6)      192.20.110.103(6)      192.20.110.112(6)      192.20.110.113(3)          'unknown','formatted',0
 5,'CoCNP.in1',   'old',    'formatted',0
 6,'CoCNP.output1up','unknown','formatted',0
10,'./CoCNP.vectorup', 'unknown','unformatted',9000
11,'CoCNP.energyup', 'unknown','formatted',0
18,'CoCNP.vspup',       'old',    'formatted',0
19,'CoCNP.vnsup',       'unknown','formatted',0
20,'CoCNP.struct',         'old',    'formatted',0
21,'CoCNP.scf1up',   'unknown','formatted',0
55,'CoCNP.vec',            'unknown','formatted',0
71,'CoCNP.nshup',    'unknown','formatted',0
   Summary of lapw1para:
   192.20.110.113  k=6  user=192.2  wallclock=11538
   192.20.110.102  k=6  user=192.2  wallclock=11538
   192.20.110.103  k=6  user=192.2  wallclock=11538
   192.20.110.112  k=6  user=192.2  wallclock=11538
0.485u 0.636s 3:56.77 0.4% 0+0k 0+0io 0pf+0w
>   lapw1  -dn -p   (18:08:21) starting parallel lapw1 at Fri May 13 18:08:21 PDT 2005
->  starting parallel LAPW1 jobs at Fri May 13 18:08:21 PDT 2005
running LAPW1 in parallel mode (using .machines.help)
4 number_of_parallel_jobs
     192.20.110.113(6)      192.20.110.102(6)      192.20.110.103(6)      192.20.110.112(6)      192.20.110.113(3)    Summary of lapw1para:
   192.20.110.113  k=6  user=192.2  wallclock=11538
   192.20.110.102  k=6  user=192.2  wallclock=11538
   192.20.110.103  k=6  user=192.2  wallclock=11538
   192.20.110.112  k=6  user=192.2  wallclock=11538
0.497u 0.616s 3:56.99 0.4% 0+0k 0+0io 0pf+0w
>   lapw2 -up -p (18:12:18) running LAPW2 in parallel mode
      192.20.110.113
      192.20.110.102
      192.20.110.103
      192.20.110.112
      192.20.110.113
   Summary of lapw2para:
   192.20.110.113  user=192.2  wallclock=11724.2
   192.20.110.102  user=192.2  wallclock=11724.2
   192.20.110.103  user=192.2  wallclock=11724.2
   192.20.110.112  user=192.2  wallclock=11724.2
14.088u 0.602s 1:07.72 21.6% 0+0k 0+0io 26pf+0w
>   lapw2 -dn -p (18:13:26) running LAPW2 in parallel mode
      192.20.110.113
      192.20.110.102
      192.20.110.103
      192.20.110.112
      192.20.110.113
   Summary of lapw2para:
   192.20.110.113  user=192.2  wallclock=11724.2
   192.20.110.102  user=192.2  wallclock=11724.2
   192.20.110.103  user=192.2  wallclock=11724.2
   192.20.110.112  user=192.2  wallclock=11724.2
4.179u 0.666s 0:51.71 9.3% 0+0k 0+0io 0pf+0w
>   lcore -up (18:14:18) 0.306u 0.005s 0:00.51 58.8% 0+0k 0+0io 5pf+0w
>   lcore -dn (18:14:19) 0.305u 0.006s 0:00.45 66.6% 0+0k 0+0io 0pf+0w
>   mixer (18:14:21) 4.539u 0.156s 0:06.25 74.8% 0+0k 0+0io 10pf+0w
:ENERGY convergence:  0 0 0
:CHARGE convergence:  0 0.0001 0
49/19 to go
.................
 
  46/16 to go
>   lapw0 -p (18:45:30) starting parallel lapw0 at Fri May 13 18:45:30 PDT 2005
--------
running lapw0 in single mode
34.723u 0.128s 0:35.82 97.2% 0+0k 0+0io 0pf+0w
>   lapw1  -up -p   (18:46:06) starting parallel lapw1 at Fri May 13 18:46:06 PDT 2005
->  starting parallel LAPW1 jobs at Fri May 13 18:46:06 PDT 2005
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
     192.20.110.113(6)      192.20.110.102(6)          'unknown','formatted',0
 5,'CoCNP.in1',   'old',    'formatted',0
 6,'CoCNP.output1up','unknown','formatted',0
10,'./CoCNP.vectorup', 'unknown','unformatted',9000
11,'CoCNP.energyup', 'unknown','formatted',0
18,'CoCNP.vspup',       'old',    'formatted',0
19,'CoCNP.vnsup',       'unknown','formatted',0
20,'CoCNP.struct',         'old',    'formatted',0
21,'CoCNP.scf1up',   'unknown','formatted',0
55,'CoCNP.vec',            'unknown','formatted',0
71,'CoCNP.nshup',    'unknown','formatted',0
     192.20.110.103(6)      192.20.110.112(6)      192.20.110.113(3)    Summary of lapw1para:
    'unknown','formatted',0
   192.20.110.113  k=12  user=384.4  wallclock=11535
   192.20.110.102  k=6  user=192.2  wallclock=0
   192.20.110.103  k=6  user=192.2  wallclock=11535
   192.20.110.112  k=6  user=192.2  wallclock=11535
0.503u 0.583s 3:57.90 0.4% 0+0k 0+0io 0pf+0w
>   lapw1  -dn -p   (18:50:04) starting parallel lapw1 at Fri May 13 18:50:04 PDT 2005
->  starting parallel LAPW1 jobs at Fri May 13 18:50:04 PDT 2005
running LAPW1 in parallel mode (using .machines.help)
4 number_of_parallel_jobs
     192.20.110.113(6)      192.20.110.102(6)      192.20.110.103(6)      192.20.110.112(6)      192.20.110.113(3)    Summary of lapw1para:
   192.20.110.113  k=6  user=192.2  wallclock=11538
   192.20.110.102  k=6  user=192.2  wallclock=11538
   192.20.110.103  k=6  user=192.2  wallclock=11538
   192.20.110.112  k=6  user=192.2  wallclock=11538
0.528u 0.587s 3:56.11 0.4% 0+0k 0+0io 0pf+0w
>   lapw2 -up -p (18:54:00) running LAPW2 in parallel mode
      192.20.110.113
      192.20.110.102
      192.20.110.103
      192.20.110.112
      192.20.110.113
   Summary of lapw2para:
   192.20.110.113  user=192.2  wallclock=11724.2
   192.20.110.102  user=192.2  wallclock=11724.2
   192.20.110.103  user=192.2  wallclock=11724.2
   192.20.110.112  user=192.2  wallclock=11724.2
4.214u 0.645s 0:57.40 8.4% 0+0k 0+0io 0pf+0w
>   lapw2 -dn -p (18:54:58) running LAPW2 in parallel mode
      192.20.110.113
      192.20.110.102
      192.20.110.103
      192.20.110.112
      192.20.110.113
   Summary of lapw2para:
   192.20.110.113  user=192.2  wallclock=11724.2
   192.20.110.102  user=192.2  wallclock=11724.2
   192.20.110.103  user=192.2  wallclock=11724.2
   192.20.110.112  user=192.2  wallclock=11724.2
4.276u 0.566s 0:51.71 9.3% 0+0k 0+0io 0pf+0w
>   lcore -up (18:55:50) 0.302u 0.007s 0:00.45 66.6% 0+0k 0+0io 0pf+0w
>   lcore -dn (18:55:50) 0.302u 0.008s 0:00.45 66.6% 0+0k 0+0io 0pf+0w
>   mixer (18:55:53) 4.592u 0.227s 0:06.69 71.8% 0+0k 0+0io 0pf+0w
:ENERGY convergence:  0 0 25.9122400000000000
:CHARGE convergence:  0 0.0001 .9523060
45/15 to go
>   lapw0 -p (18:56:00) starting parallel lapw0 at Fri May 13 18:56:00 PDT 2005
--------
running lapw0 in single mode
34.770u 0.142s 0:35.85 97.3% 0+0k 0+0io 0pf+0w
>   lapw1  -up -p   (18:56:36) starting parallel lapw1 at Fri May 13 18:56:36 PDT 2005
->  starting parallel LAPW1 jobs at Fri May 13 18:56:36 PDT 2005
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
     192.20.110.113(6)      192.20.110.102(6)      192.20.110.103(6)      192.20.110.112(6)      192.20.110.113(3)    Summary of lapw1para:
    'unknown','formatted',0
   192.20.110.113  k=6  user=192.2  wallclock=11538
   192.20.110.102  k=6  user=192.2  wallclock=11538
   192.20.110.103  k=6  user=192.2  wallclock=11538
   192.20.110.112  k=6  user=192.2  wallclock=11538
0.516u 0.556s 3:57.05 0.4% 0+0k 0+0io 0pf+0w
>   lapw1  -dn -p   (19:00:34) starting parallel lapw1 at Fri May 13 19:00:34 PDT 2005
->  starting parallel LAPW1 jobs at Fri May 13 19:00:34 PDT 2005
running LAPW1 in parallel mode (using .machines.help)
4 number_of_parallel_jobs
     192.20.110.113(6)      192.20.110.102(6)      192.20.110.103(6)      192.20.110.112(6)      192.20.110.113(3)    Summary of lapw1para:
   192.20.110.113  k=6  user=192.2  wallclock=11538
   192.20.110.102  k=6  user=192.2  wallclock=11538
   192.20.110.103  k=6  user=192.2  wallclock=11538
   192.20.110.112  k=6  user=192.2  wallclock=11538
0.526u 0.545s 3:56.46 0.4% 0+0k 0+0io 0pf+0w
>   lapw2 -up -p (19:04:30) running LAPW2 in parallel mode
      192.20.110.113
      192.20.110.102
      192.20.110.103
      192.20.110.112
      192.20.110.113
   Summary of lapw2para:
   192.20.110.113  user=192.2  wallclock=11724.2
   192.20.110.102  user=192.2  wallclock=11724.2
   192.20.110.103  user=192.2  wallclock=11724.2
   192.20.110.112  user=192.2  wallclock=11724.2
4.343u 0.520s 0:57.59 8.4% 0+0k 0+0io 0pf+0w
>   lapw2 -dn -p (19:05:28) running LAPW2 in parallel mode
      192.20.110.113
      192.20.110.102
      192.20.110.103
      192.20.110.112
      192.20.110.113
   Summary of lapw2para:
   192.20.110.113  user=192.2  wallclock=11724.2
   192.20.110.102  user=192.2  wallclock=11724.2
   192.20.110.103  user=192.2  wallclock=11724.2
   192.20.110.112  user=192.2  wallclock=11724.2
4.282u 0.576s 0:52.12 9.3% 0+0k 0+0io 0pf+0w
>   lcore -up (19:06:20) 0.299u 0.007s 0:00.45 64.4% 0+0k 0+0io 0pf+0w
>   lcore -dn (19:06:21) 0.301u 0.002s 0:00.45 66.6% 0+0k 0+0io 0pf+0w
>   mixer (19:06:24) 4.579u 0.228s 0:06.68 71.7% 0+0k 0+0io 0pf+0w
:ENERGY convergence:  0 0 25.9101540000000000
:CHARGE convergence:  0 0.0001 .9395848
44/14 to go
...........
 
>41/11 to go
>   lapw0 -p (19:38:05) starting parallel lapw0 at Fri May 13 19:38:06 PDT 2005
--------
running lapw0 in single mode
34.719u 0.130s 0:35.85 97.1% 0+0k 0+0io 0pf+0w
>   lapw1  -up -p   (19:38:41) starting parallel lapw1 at Fri May 13 19:38:42 PDT 2005
->  starting parallel LAPW1 jobs at Fri May 13 19:38:42 PDT 2005
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
     192.20.110.113(6)      192.20.110.102(6)      192.20.110.103(6)      192.20.110.112(6)      192.20.110.113(3)    Summary of lapw1para:
   192.20.110.113  k=6  user=192.2  wallclock=11538
   192.20.110.102  k=6  user=192.2  wallclock=11538
   192.20.110.103  k=6  user=192.2  wallclock=11538
   192.20.110.112  k=6  user=192.2  wallclock=11538
0.499u 0.597s 3:57.62 0.4% 0+0k 0+0io 0pf+0w
>   lapw1  -dn -p   (19:42:39) starting parallel lapw1 at Fri May 13 19:42:39 PDT 2005
->  starting parallel LAPW1 jobs at Fri May 13 19:42:39 PDT 2005
running LAPW1 in parallel mode (using .machines.help)
4 number_of_parallel_jobs
     192.20.110.113(6)      192.20.110.102(6)      192.20.110.103(6)      192.20.110.112(6)      192.20.110.113(3)    Summary of lapw1para:
    'unknown','formatted',0
   192.20.110.113  k=6  user=192.2  wallclock=11538
   192.20.110.102  k=6  user=192.2  wallclock=11538
   192.20.110.103  k=6  user=192.2  wallclock=11538
   192.20.110.112  k=6  user=192.2  wallclock=11538
0.524u 0.572s 3:56.08 0.4% 0+0k 0+0io 0pf+0w
>   lapw2 -up -p (19:46:35) running LAPW2 in parallel mode
**  LAPW2 crashed!
0.031u 0.054s 0:00.46 17.3% 0+0k 0+0io 0pf+0w
>   stop error
*******************************************************************************************
nohup.out file:
**************************************************************************************
real 0m16.442s
user 0m15.235s
sys 0m0.206s
 SUMPARA END
 SUMPARA END
LAPW2 - FERMI; weighs written
What manual page do you want?
What manual page do you want?
What manual page do you want?
What manual page do you want?
 LAPW2 END
real 0m28.442s
user 0m27.135s
sys 0m0.335s
What manual page do you want?
 LAPW2 END
real 0m28.615s
user 0m27.177s
sys 0m0.367s
 LAPW2 END
real 0m28.479s
user 0m27.053s
sys 0m0.363s
 LAPW2 END
real 0m28.533s
user 0m27.121s
sys 0m0.360s
 LAPW2 END
real 0m15.251s
user 0m14.101s
sys 0m0.213s
 SUMPARA END
 SUMPARA END
 CORE  END
 CORE  END
 MIXER END
in cycle 10    ETEST: .0047225000000000   CTEST: .9605146
 LAPW0 END
What manual page do you want?
What manual page do you want?
What manual page do you want?
What manual page do you want?
 LAPW1 END
real 2m34.056s
user 2m32.098s
sys 0m1.343s
 LAPW1 END
real 2m34.199s
user 2m32.522s
sys 0m1.257s
 LAPW1 END
real 2m33.341s
user 2m31.461s
sys 0m1.460s
 LAPW1 END
real 2m33.341s
user 2m31.567s
sys 0m1.366s
What manual page do you want?
 LAPW1 END
real 1m17.194s
user 1m16.236s
sys 0m0.669s
What manual page do you want?
What manual page do you want?
What manual page do you want?
What manual page do you want?
 LAPW1 END
real 2m32.828s
user 2m31.137s
sys 0m1.306s
 LAPW1 END
real 2m32.411s
user 2m30.582s
sys 0m1.414s
 LAPW1 END
 LAPW1 END
real 2m34.797s
user 2m32.246s
sys 0m1.520s
real 2m32.347s
user 2m30.540s
sys 0m1.329s
What manual page do you want?
 LAPW1 END
real 1m16.317s
user 1m15.325s
sys 0m0.715s
PGFIO-F-231/formatted read/unit=5/error on data conversion.
 File name = CoCNP.in2    formatted, sequential access   record = 3
 In source file lapw2_tmp_.F, at line number 164
cp: cannot stat `.in.tmp': No such file or directory
rm: cannot remove `.in.tmp': No such file or directory
rm: cannot remove `.in.tmp1': No such file or directory
**************************************************************************************
I find if there  appears 'unknown','formatted' in dayfile, the calculation while stop. 
Can you tell me why appears  'unknown','formatted' in dayfile, and how to solve it?
If I use the single computer, it works very well.


 





---------------------------------
Do You Yahoo!?
注册世界一流品质的雅虎免费电邮



---------------------------------
Do You Yahoo!?
150万曲MP3疯狂搜,带您闯入音乐殿堂
美女明星应有尽有,搜遍美图、艳图和酷图
1G就是1000兆,雅虎电邮自助扩容!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20050518/e1a7b928/attachment.html


More information about the Wien mailing list