[Wien] parell calculation error?(single computer ok)
lin zhu
zhuwien at yahoo.com.cn
Wed May 18 04:35:11 CEST 2005
Dear all:
I use the supercomputer to calculate,I use four computers to calculate parelly, But there are always an error as following(case.dayfile and nohup.out file):
case.dayfile:
*************************************************************************************
Calculating CoCNP in /home/zul3/CoCNP
on n12
start (Fri May 13 18:03:48 PDT 2005) with lapw0 (50/20 to go)
> lapw0 -p (18:03:48) starting parallel lapw0 at Fri May 13 18:03:48 PDT 2005
--------
running lapw0 in single mode
34.667u 0.156s 0:35.91 96.9% 0+0k 0+0io 1pf+0w
> lapw1 -up -p (18:04:24) starting parallel lapw1 at Fri May 13 18:04:24 PDT 2005
-> starting parallel LAPW1 jobs at Fri May 13 18:04:24 PDT 2005
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
192.20.110.113(6) 192.20.110.102(6) 192.20.110.103(6) 192.20.110.112(6) 192.20.110.113(3) 'unknown','formatted',0
5,'CoCNP.in1', 'old', 'formatted',0
6,'CoCNP.output1up','unknown','formatted',0
10,'./CoCNP.vectorup', 'unknown','unformatted',9000
11,'CoCNP.energyup', 'unknown','formatted',0
18,'CoCNP.vspup', 'old', 'formatted',0
19,'CoCNP.vnsup', 'unknown','formatted',0
20,'CoCNP.struct', 'old', 'formatted',0
21,'CoCNP.scf1up', 'unknown','formatted',0
55,'CoCNP.vec', 'unknown','formatted',0
71,'CoCNP.nshup', 'unknown','formatted',0
Summary of lapw1para:
192.20.110.113 k=6 user=192.2 wallclock=11538
192.20.110.102 k=6 user=192.2 wallclock=11538
192.20.110.103 k=6 user=192.2 wallclock=11538
192.20.110.112 k=6 user=192.2 wallclock=11538
0.485u 0.636s 3:56.77 0.4% 0+0k 0+0io 0pf+0w
> lapw1 -dn -p (18:08:21) starting parallel lapw1 at Fri May 13 18:08:21 PDT 2005
-> starting parallel LAPW1 jobs at Fri May 13 18:08:21 PDT 2005
running LAPW1 in parallel mode (using .machines.help)
4 number_of_parallel_jobs
192.20.110.113(6) 192.20.110.102(6) 192.20.110.103(6) 192.20.110.112(6) 192.20.110.113(3) Summary of lapw1para:
192.20.110.113 k=6 user=192.2 wallclock=11538
192.20.110.102 k=6 user=192.2 wallclock=11538
192.20.110.103 k=6 user=192.2 wallclock=11538
192.20.110.112 k=6 user=192.2 wallclock=11538
0.497u 0.616s 3:56.99 0.4% 0+0k 0+0io 0pf+0w
> lapw2 -up -p (18:12:18) running LAPW2 in parallel mode
192.20.110.113
192.20.110.102
192.20.110.103
192.20.110.112
192.20.110.113
Summary of lapw2para:
192.20.110.113 user=192.2 wallclock=11724.2
192.20.110.102 user=192.2 wallclock=11724.2
192.20.110.103 user=192.2 wallclock=11724.2
192.20.110.112 user=192.2 wallclock=11724.2
14.088u 0.602s 1:07.72 21.6% 0+0k 0+0io 26pf+0w
> lapw2 -dn -p (18:13:26) running LAPW2 in parallel mode
192.20.110.113
192.20.110.102
192.20.110.103
192.20.110.112
192.20.110.113
Summary of lapw2para:
192.20.110.113 user=192.2 wallclock=11724.2
192.20.110.102 user=192.2 wallclock=11724.2
192.20.110.103 user=192.2 wallclock=11724.2
192.20.110.112 user=192.2 wallclock=11724.2
4.179u 0.666s 0:51.71 9.3% 0+0k 0+0io 0pf+0w
> lcore -up (18:14:18) 0.306u 0.005s 0:00.51 58.8% 0+0k 0+0io 5pf+0w
> lcore -dn (18:14:19) 0.305u 0.006s 0:00.45 66.6% 0+0k 0+0io 0pf+0w
> mixer (18:14:21) 4.539u 0.156s 0:06.25 74.8% 0+0k 0+0io 10pf+0w
:ENERGY convergence: 0 0 0
:CHARGE convergence: 0 0.0001 0
49/19 to go
.................
46/16 to go
> lapw0 -p (18:45:30) starting parallel lapw0 at Fri May 13 18:45:30 PDT 2005
--------
running lapw0 in single mode
34.723u 0.128s 0:35.82 97.2% 0+0k 0+0io 0pf+0w
> lapw1 -up -p (18:46:06) starting parallel lapw1 at Fri May 13 18:46:06 PDT 2005
-> starting parallel LAPW1 jobs at Fri May 13 18:46:06 PDT 2005
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
192.20.110.113(6) 192.20.110.102(6) 'unknown','formatted',0
5,'CoCNP.in1', 'old', 'formatted',0
6,'CoCNP.output1up','unknown','formatted',0
10,'./CoCNP.vectorup', 'unknown','unformatted',9000
11,'CoCNP.energyup', 'unknown','formatted',0
18,'CoCNP.vspup', 'old', 'formatted',0
19,'CoCNP.vnsup', 'unknown','formatted',0
20,'CoCNP.struct', 'old', 'formatted',0
21,'CoCNP.scf1up', 'unknown','formatted',0
55,'CoCNP.vec', 'unknown','formatted',0
71,'CoCNP.nshup', 'unknown','formatted',0
192.20.110.103(6) 192.20.110.112(6) 192.20.110.113(3) Summary of lapw1para:
'unknown','formatted',0
192.20.110.113 k=12 user=384.4 wallclock=11535
192.20.110.102 k=6 user=192.2 wallclock=0
192.20.110.103 k=6 user=192.2 wallclock=11535
192.20.110.112 k=6 user=192.2 wallclock=11535
0.503u 0.583s 3:57.90 0.4% 0+0k 0+0io 0pf+0w
> lapw1 -dn -p (18:50:04) starting parallel lapw1 at Fri May 13 18:50:04 PDT 2005
-> starting parallel LAPW1 jobs at Fri May 13 18:50:04 PDT 2005
running LAPW1 in parallel mode (using .machines.help)
4 number_of_parallel_jobs
192.20.110.113(6) 192.20.110.102(6) 192.20.110.103(6) 192.20.110.112(6) 192.20.110.113(3) Summary of lapw1para:
192.20.110.113 k=6 user=192.2 wallclock=11538
192.20.110.102 k=6 user=192.2 wallclock=11538
192.20.110.103 k=6 user=192.2 wallclock=11538
192.20.110.112 k=6 user=192.2 wallclock=11538
0.528u 0.587s 3:56.11 0.4% 0+0k 0+0io 0pf+0w
> lapw2 -up -p (18:54:00) running LAPW2 in parallel mode
192.20.110.113
192.20.110.102
192.20.110.103
192.20.110.112
192.20.110.113
Summary of lapw2para:
192.20.110.113 user=192.2 wallclock=11724.2
192.20.110.102 user=192.2 wallclock=11724.2
192.20.110.103 user=192.2 wallclock=11724.2
192.20.110.112 user=192.2 wallclock=11724.2
4.214u 0.645s 0:57.40 8.4% 0+0k 0+0io 0pf+0w
> lapw2 -dn -p (18:54:58) running LAPW2 in parallel mode
192.20.110.113
192.20.110.102
192.20.110.103
192.20.110.112
192.20.110.113
Summary of lapw2para:
192.20.110.113 user=192.2 wallclock=11724.2
192.20.110.102 user=192.2 wallclock=11724.2
192.20.110.103 user=192.2 wallclock=11724.2
192.20.110.112 user=192.2 wallclock=11724.2
4.276u 0.566s 0:51.71 9.3% 0+0k 0+0io 0pf+0w
> lcore -up (18:55:50) 0.302u 0.007s 0:00.45 66.6% 0+0k 0+0io 0pf+0w
> lcore -dn (18:55:50) 0.302u 0.008s 0:00.45 66.6% 0+0k 0+0io 0pf+0w
> mixer (18:55:53) 4.592u 0.227s 0:06.69 71.8% 0+0k 0+0io 0pf+0w
:ENERGY convergence: 0 0 25.9122400000000000
:CHARGE convergence: 0 0.0001 .9523060
45/15 to go
> lapw0 -p (18:56:00) starting parallel lapw0 at Fri May 13 18:56:00 PDT 2005
--------
running lapw0 in single mode
34.770u 0.142s 0:35.85 97.3% 0+0k 0+0io 0pf+0w
> lapw1 -up -p (18:56:36) starting parallel lapw1 at Fri May 13 18:56:36 PDT 2005
-> starting parallel LAPW1 jobs at Fri May 13 18:56:36 PDT 2005
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
192.20.110.113(6) 192.20.110.102(6) 192.20.110.103(6) 192.20.110.112(6) 192.20.110.113(3) Summary of lapw1para:
'unknown','formatted',0
192.20.110.113 k=6 user=192.2 wallclock=11538
192.20.110.102 k=6 user=192.2 wallclock=11538
192.20.110.103 k=6 user=192.2 wallclock=11538
192.20.110.112 k=6 user=192.2 wallclock=11538
0.516u 0.556s 3:57.05 0.4% 0+0k 0+0io 0pf+0w
> lapw1 -dn -p (19:00:34) starting parallel lapw1 at Fri May 13 19:00:34 PDT 2005
-> starting parallel LAPW1 jobs at Fri May 13 19:00:34 PDT 2005
running LAPW1 in parallel mode (using .machines.help)
4 number_of_parallel_jobs
192.20.110.113(6) 192.20.110.102(6) 192.20.110.103(6) 192.20.110.112(6) 192.20.110.113(3) Summary of lapw1para:
192.20.110.113 k=6 user=192.2 wallclock=11538
192.20.110.102 k=6 user=192.2 wallclock=11538
192.20.110.103 k=6 user=192.2 wallclock=11538
192.20.110.112 k=6 user=192.2 wallclock=11538
0.526u 0.545s 3:56.46 0.4% 0+0k 0+0io 0pf+0w
> lapw2 -up -p (19:04:30) running LAPW2 in parallel mode
192.20.110.113
192.20.110.102
192.20.110.103
192.20.110.112
192.20.110.113
Summary of lapw2para:
192.20.110.113 user=192.2 wallclock=11724.2
192.20.110.102 user=192.2 wallclock=11724.2
192.20.110.103 user=192.2 wallclock=11724.2
192.20.110.112 user=192.2 wallclock=11724.2
4.343u 0.520s 0:57.59 8.4% 0+0k 0+0io 0pf+0w
> lapw2 -dn -p (19:05:28) running LAPW2 in parallel mode
192.20.110.113
192.20.110.102
192.20.110.103
192.20.110.112
192.20.110.113
Summary of lapw2para:
192.20.110.113 user=192.2 wallclock=11724.2
192.20.110.102 user=192.2 wallclock=11724.2
192.20.110.103 user=192.2 wallclock=11724.2
192.20.110.112 user=192.2 wallclock=11724.2
4.282u 0.576s 0:52.12 9.3% 0+0k 0+0io 0pf+0w
> lcore -up (19:06:20) 0.299u 0.007s 0:00.45 64.4% 0+0k 0+0io 0pf+0w
> lcore -dn (19:06:21) 0.301u 0.002s 0:00.45 66.6% 0+0k 0+0io 0pf+0w
> mixer (19:06:24) 4.579u 0.228s 0:06.68 71.7% 0+0k 0+0io 0pf+0w
:ENERGY convergence: 0 0 25.9101540000000000
:CHARGE convergence: 0 0.0001 .9395848
44/14 to go
...........
>41/11 to go
> lapw0 -p (19:38:05) starting parallel lapw0 at Fri May 13 19:38:06 PDT 2005
--------
running lapw0 in single mode
34.719u 0.130s 0:35.85 97.1% 0+0k 0+0io 0pf+0w
> lapw1 -up -p (19:38:41) starting parallel lapw1 at Fri May 13 19:38:42 PDT 2005
-> starting parallel LAPW1 jobs at Fri May 13 19:38:42 PDT 2005
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
192.20.110.113(6) 192.20.110.102(6) 192.20.110.103(6) 192.20.110.112(6) 192.20.110.113(3) Summary of lapw1para:
192.20.110.113 k=6 user=192.2 wallclock=11538
192.20.110.102 k=6 user=192.2 wallclock=11538
192.20.110.103 k=6 user=192.2 wallclock=11538
192.20.110.112 k=6 user=192.2 wallclock=11538
0.499u 0.597s 3:57.62 0.4% 0+0k 0+0io 0pf+0w
> lapw1 -dn -p (19:42:39) starting parallel lapw1 at Fri May 13 19:42:39 PDT 2005
-> starting parallel LAPW1 jobs at Fri May 13 19:42:39 PDT 2005
running LAPW1 in parallel mode (using .machines.help)
4 number_of_parallel_jobs
192.20.110.113(6) 192.20.110.102(6) 192.20.110.103(6) 192.20.110.112(6) 192.20.110.113(3) Summary of lapw1para:
'unknown','formatted',0
192.20.110.113 k=6 user=192.2 wallclock=11538
192.20.110.102 k=6 user=192.2 wallclock=11538
192.20.110.103 k=6 user=192.2 wallclock=11538
192.20.110.112 k=6 user=192.2 wallclock=11538
0.524u 0.572s 3:56.08 0.4% 0+0k 0+0io 0pf+0w
> lapw2 -up -p (19:46:35) running LAPW2 in parallel mode
** LAPW2 crashed!
0.031u 0.054s 0:00.46 17.3% 0+0k 0+0io 0pf+0w
> stop error
*******************************************************************************************
nohup.out file:
**************************************************************************************
real 0m16.442s
user 0m15.235s
sys 0m0.206s
SUMPARA END
SUMPARA END
LAPW2 - FERMI; weighs written
What manual page do you want?
What manual page do you want?
What manual page do you want?
What manual page do you want?
LAPW2 END
real 0m28.442s
user 0m27.135s
sys 0m0.335s
What manual page do you want?
LAPW2 END
real 0m28.615s
user 0m27.177s
sys 0m0.367s
LAPW2 END
real 0m28.479s
user 0m27.053s
sys 0m0.363s
LAPW2 END
real 0m28.533s
user 0m27.121s
sys 0m0.360s
LAPW2 END
real 0m15.251s
user 0m14.101s
sys 0m0.213s
SUMPARA END
SUMPARA END
CORE END
CORE END
MIXER END
in cycle 10 ETEST: .0047225000000000 CTEST: .9605146
LAPW0 END
What manual page do you want?
What manual page do you want?
What manual page do you want?
What manual page do you want?
LAPW1 END
real 2m34.056s
user 2m32.098s
sys 0m1.343s
LAPW1 END
real 2m34.199s
user 2m32.522s
sys 0m1.257s
LAPW1 END
real 2m33.341s
user 2m31.461s
sys 0m1.460s
LAPW1 END
real 2m33.341s
user 2m31.567s
sys 0m1.366s
What manual page do you want?
LAPW1 END
real 1m17.194s
user 1m16.236s
sys 0m0.669s
What manual page do you want?
What manual page do you want?
What manual page do you want?
What manual page do you want?
LAPW1 END
real 2m32.828s
user 2m31.137s
sys 0m1.306s
LAPW1 END
real 2m32.411s
user 2m30.582s
sys 0m1.414s
LAPW1 END
LAPW1 END
real 2m34.797s
user 2m32.246s
sys 0m1.520s
real 2m32.347s
user 2m30.540s
sys 0m1.329s
What manual page do you want?
LAPW1 END
real 1m16.317s
user 1m15.325s
sys 0m0.715s
PGFIO-F-231/formatted read/unit=5/error on data conversion.
File name = CoCNP.in2 formatted, sequential access record = 3
In source file lapw2_tmp_.F, at line number 164
cp: cannot stat `.in.tmp': No such file or directory
rm: cannot remove `.in.tmp': No such file or directory
rm: cannot remove `.in.tmp1': No such file or directory
**************************************************************************************
I find if there appears 'unknown','formatted' in dayfile, the calculation while stop.
Can you tell me why appears 'unknown','formatted' in dayfile, and how to solve it?
If I use the single computer, it works very well.
---------------------------------
Do You Yahoo!?
注册世界一流品质的雅虎免费电邮
---------------------------------
Do You Yahoo!?
150万曲MP3疯狂搜,带您闯入音乐殿堂
美女明星应有尽有,搜遍美图、艳图和酷图
1G就是1000兆,雅虎电邮自助扩容!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20050518/e1a7b928/attachment.html
More information about the Wien
mailing list