[Wien] machines file
Peter Blaha
pblaha at zeus.theochem.tuwien.ac.at
Wed Apr 6 20:09:24 CEST 2005
We had quite some discussions on mpi parallel jobs:
a) It does not make much sense to mpi parallelize on a dual node.
You will not gain much, neither speed nor memory.
Only from 4 hosts on it makes "sense", and most likely you may want a
fast network (infiniband, myrinet).
b) Do you have WIEN2k experience ? If not, forget the mpi version for the
moment.
b) Do you have an mpi and sclapack installed ?
If yes, check again your compile.msg in eg. SRC_lapw0
> Could somebody help me on .machine file?
>
> We have just combined the Wien2k for MPI. We tested this with TiC
>
> Our system information is :
> Dual processors AMD 64 Opteron
> OS: Fedora Core 1 x86_64
> Host name: Darwin
> Nodes: Opto0xx ; per node has 2 CPU with the same speed
>
> The Wien2k Version is 25/02/2005
>
> We have files: lapw1para, lapwsopara, lapwdmpara, lapw2para, lapw0_mpi,
> lapw1_mpi, and Lapw2_mpi
>
> For TiC, We tested with 72- k-points
>
> our machines file is
>
> granularity:1
> 18:opto024:2
> 18:opto025:2
> 36:opto030:2
> lapw0:opto024 opto025
>
> By using paratest we obtained:
>
> Test: LAPW1 in parallel mode (using .machines)
> Granularity set to 1
> Extrafine unset
>
> klist: 72
> machines: opto024 opto025 opto030
> procs: 3
> weigh(old): 18 18 36
> sumw: 72
> granularity: 1
> weigh(new): 18 18 36
>
> Distribution of k-point (under ideal conditions)
> will be:
>
> 1 : opto024(18) 18k
> 2 : opto025(18) 18k
> 3 : opto030(36) 36k
>
> By using this machine file when we ran SCF calculations we met the following
> error:
>
> cycle 1 (Wed Apr 6 17:34:32 SGT 2005) (20/20 to go)
>
> > lapw0 -p (17:34:32) starting parallel lapw0 at Wed Apr 6 17:34:32 SGT
> 2005
> -------- .machine1 : 2 processors
> opto024
> opto025
> --------
> 0.020u 0.010s 0:00.06 50.0% 0+0k 0+0io 2588pf+0w
>
> > stop error
>
> When I check STDOUT: I found
>
> 1[1]: No match.
>
> I checked files and I saw the following files were not generated: case.vsp,
> case.vns, and empty files case.clmup/dn
>
> But when I use the following .machine file
>
> granularity:1
> 18:opto024:2
> 18:opto025:2
> 36:opto030:2
>
> LAPW0 ran well, but LAWP1 crashed.
>
> The show dayfile is
>
> cycle 1 (Wed Apr 6 18:05:39 SGT 2005) (20/20 to go)
>
> > lapw0 -p (18:05:39) starting parallel lapw0 at Wed Apr 6 18:05:39 SGT
> 2005
> --------
> running lapw0 in single mode
> 1.960u 0.030s 0:02.13 93.4% 0+0k 0+0io 2502pf+0w
> > lapw1 -p (18:05:41) starting parallel lapw1 at Wed Apr 6 18:05:41 SGT
> 2005
> -> starting parallel LAPW1 jobs at Wed Apr 6 18:05:41 SGT 2005
> running LAPW1 in parallel mode (using .machines)
> 3 number_of_parallel_jobs
> ** LAPW1 crashed!
> 0.030u 0.040s 0:05.34 1.3% 0+0k 0+0io 14149pf+0w
>
> > stop error
>
> STDOUT
>
> STOP LAPW0 END
> STOP
> bash: line 1: lapw1: command not found
>
> real 0m0.003s
> user 0m0.000s
> sys 0m0.000s
> bash: line 1: lapw1: command not found
>
> real 0m0.002s
> user 0m0.000s
> sys 0m0.000s
> bash: lapw1: command not found
>
> real 0m0.002s
> user 0m0.000s
> sys 0m0.001s
> cat: No match.
>
> We really do not understand the reason for these errors. But we notice that if
> we run in remote mode, i.e., we use the machine file for host Darwin
>
> granularity:1
> 18:darwin
> 18:darwin
> 36:darwin
>
> then everything was ok.
>
> Could someone help us to clarify these errors and how to overcome this? Many
> thanks.
>
> Best regards,
> Khuong
>
P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-15671 FAX: +43-1-58801-15698
Email: blaha at theochem.tuwien.ac.at WWW: http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------
More information about the Wien
mailing list