[Wien] machines file ... again
Griselda Garcia
ggarcia at fis.puc.cl
Tue Oct 5 18:28:04 CEST 2004
Hello all!
I an trying to set up a parallel calculation on a PC's cluster (10 dual
machines, each one is called fisnodeX). When I use the K-point
paralellization and run the program as run_lapw -p, everything is ok.
[griselda at clustersvr si-bulk]$ more .machines
granularity:1
1:fisnode2
1:fisnode3
1:fisnode4
1:fisnode5
[griselda at clustersvr si-bulk]$ testpara_lapw
#####################################################
# TESTPARA #
#####################################################
Test: LAPW1 in parallel mode (using .machines)
Granularity set to 1
Extrafine unset
klist: 8
machines: fisnode2 fisnode3 fisnode4 fisnode5
procs: 4
weigh(old): 1 1 1 1
sumw: 4
granularity: 1
weigh(new): 2 2 2 2
Distribution of k-point (under ideal conditions)
will be:
1 : fisnode2(2) 2k
2 : fisnode3(2) 2k
3 : fisnode4(2) 2k
4 : fisnode5(2) 2k
[griselda at clustersvr si-bulk]$
Then I try to use the fine grained version of parallelization, now my
machines file (using one processor of each node) is:
griselda at clustersvr si-bulk]$ more .machines
granularity:1
1:fisnode2 fisnode3 fisnode4 fisnode5 fisnode6 fisnode7 fisnode8 fisnode9
lapw0:fisnode1:2
[griselda at clustersvr si-bulk]$ testpara_lapw
#####################################################
# TESTPARA #
#####################################################
Test: LAPW1 in parallel mode (using .machines)
Granularity set to 1
Extrafine unset
klist: 8
machines: fisnode2 fisnode3 fisnode4 fisnode5 fisnode6 fisnode7
fisnode8 fisnode9
procs: 1
weigh(old): 1
sumw: 1
granularity: 1
weigh(new): 8
Distribution of k-point (under ideal conditions)
will be:
1 : fisnode2(8) 8k
[griselda at clustersvr si-bulk]$
I do not realize what it is wrong in the machine file. Why just one processor
will be used to calculate the 8 Kpoints?.
I read several times the user's guide but i can not get running the mpi
version.
The lapw0 works fine with two processors but lapw1 does not run. The dayfile
file is:
[griselda at clustersvr si-bulk]$ more case.dayfile
Calculating case in /home/griselda/WIEN/case/case
on clustersvr
start (Tue Oct 5 12:27:53 EDT 2004) with lapw0 (20/20 to go)
> lapw0 -p (12:27:53) starting parallel lapw0 at Tue Oct 5 12:27:53 EDT
2004
-------- .machine1 : 2 processors
fisnode1:2
--------
12.160u 12.530s 0:18.92 130.4% 0+0k 0+0io 11341pf+0w
> lapw1 -p (12:28:12) starting parallel lapw1 at Tue Oct 5 12:28:12 EDT
2004
-> starting parallel LAPW1 jobs at Tue Oct 5 12:28:12 EDT 2004
running LAPW1 in parallel mode (using .machines)
1 number_of_parallel_jobs
** LAPW1 crashed!
0.150u 0.180s 0:03.27 10.0% 0+0k 0+0io 12563pf+0w
> stop error
I do not find any clue about the errors.
I will really appreciate your help.
Griselda.
More information about the Wien
mailing list