[Wien] machines file
Khuong P. Ong
ongpk at ihpc.a-star.edu.sg
Wed Apr 6 12:27:52 CEST 2005
Dear wien users,
Could somebody help me on .machine file?
We have just combined the Wien2k for MPI. We tested this with TiC
Our system information is :
Dual processors AMD 64 Opteron
OS: Fedora Core 1 x86_64
Host name: Darwin
Nodes: Opto0xx ; per node has 2 CPU with the same speed
The Wien2k Version is 25/02/2005
We have files: lapw1para, lapwsopara, lapwdmpara, lapw2para, lapw0_mpi,
lapw1_mpi, and Lapw2_mpi
For TiC, We tested with 72- k-points
our machines file is
granularity:1
18:opto024:2
18:opto025:2
36:opto030:2
lapw0:opto024 opto025
By using paratest we obtained:
Test: LAPW1 in parallel mode (using .machines)
Granularity set to 1
Extrafine unset
klist: 72
machines: opto024 opto025 opto030
procs: 3
weigh(old): 18 18 36
sumw: 72
granularity: 1
weigh(new): 18 18 36
Distribution of k-point (under ideal conditions)
will be:
1 : opto024(18) 18k
2 : opto025(18) 18k
3 : opto030(36) 36k
By using this machine file when we ran SCF calculations we met the
following error:
cycle 1 (Wed Apr 6 17:34:32 SGT 2005) (20/20 to go)
> lapw0 -p (17:34:32) starting parallel lapw0 at Wed Apr 6 17:34:32
SGT 2005
-------- .machine1 : 2 processors
opto024
opto025
--------
0.020u 0.010s 0:00.06 50.0% 0+0k 0+0io 2588pf+0w
> stop error
When I check STDOUT: I found
1[1]: No match.
I checked files and I saw the following files were not
generated: case.vsp, case.vns, and empty files case.clmup/dn
But when I use the following .machine file
granularity:1
18:opto024:2
18:opto025:2
36:opto030:2
LAPW0 ran well, but LAWP1 crashed.
The show dayfile is
cycle 1 (Wed Apr 6 18:05:39 SGT 2005) (20/20 to go)
> lapw0 -p (18:05:39) starting parallel lapw0 at Wed Apr 6 18:05:39
SGT 2005
--------
running lapw0 in single mode
1.960u 0.030s 0:02.13 93.4% 0+0k 0+0io 2502pf+0w
> lapw1 -p (18:05:41) starting parallel lapw1 at Wed Apr 6 18:05:41
SGT 2005
-> starting parallel LAPW1 jobs at Wed Apr 6 18:05:41 SGT 2005
running LAPW1 in parallel mode (using .machines)
3 number_of_parallel_jobs
** LAPW1 crashed!
0.030u 0.040s 0:05.34 1.3% 0+0k 0+0io 14149pf+0w
> stop error
STDOUT
STOP LAPW0 END
STOP
bash: line 1: lapw1: command not found
real 0m0.003s
user 0m0.000s
sys 0m0.000s
bash: line 1: lapw1: command not found
real 0m0.002s
user 0m0.000s
sys 0m0.000s
bash: lapw1: command not found
real 0m0.002s
user 0m0.000s
sys 0m0.001s
cat: No match.
We really do not understand the reason for these errors. But we notice that
if we run in remote mode, i.e., we use the machine file for host Darwin
granularity:1
18:darwin
18:darwin
36:darwin
then everything was ok.
Could someone help us to clarify these errors and how to overcome this?
Many thanks.
Best regards,
Khuong
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20050406/3a610d9d/attachment.html
More information about the Wien
mailing list