[Wien] machines file ... again

Tue Oct 5 18:28:04 CEST 2004

Hello all!

I an trying to set up a parallel calculation on a PC's cluster (10 dual 
machines, each one is called fisnodeX). When I use the K-point 
paralellization and run the program as run_lapw -p, everything is ok.

[griselda at clustersvr si-bulk]$ more .machines
granularity:1
1:fisnode2
1:fisnode3
1:fisnode4
1:fisnode5

[griselda at clustersvr si-bulk]$ testpara_lapw

#####################################################
#                     TESTPARA                      #
#####################################################

Test: LAPW1 in parallel mode (using .machines)
Granularity set to 1
Extrafine unset

    klist:       8
    machines:    fisnode2 fisnode3 fisnode4 fisnode5
    procs:       4
    weigh(old):  1 1 1 1
    sumw:        4
    granularity: 1
    weigh(new):  2 2 2 2

Distribution of k-point (under ideal conditions)
will be:

1 : fisnode2(2) 2k
2 : fisnode3(2) 2k
3 : fisnode4(2) 2k
4 : fisnode5(2) 2k
[griselda at clustersvr si-bulk]$

Then I try to use the fine grained version of parallelization, now my 
machines file (using one processor of each node) is:

griselda at clustersvr si-bulk]$ more .machines
granularity:1
1:fisnode2 fisnode3 fisnode4 fisnode5 fisnode6 fisnode7 fisnode8 fisnode9
lapw0:fisnode1:2

[griselda at clustersvr si-bulk]$ testpara_lapw

#####################################################
#                     TESTPARA                      #
#####################################################

Test: LAPW1 in parallel mode (using .machines)
Granularity set to 1
Extrafine unset

    klist:       8
    machines:    fisnode2 fisnode3 fisnode4 fisnode5 fisnode6 fisnode7 
fisnode8 fisnode9
    procs:       1
    weigh(old):  1
    sumw:        1
    granularity: 1
    weigh(new):  8

Distribution of k-point (under ideal conditions)
will be:

1 : fisnode2(8) 8k
[griselda at clustersvr si-bulk]$

I do not realize what it is wrong in the machine file. Why just one processor 
will be used to calculate the 8 Kpoints?. 

I read several times the user's guide but i can not get running the mpi 
version.

The lapw0 works fine with two processors but lapw1 does not run. The dayfile 
file is:

[griselda at clustersvr si-bulk]$ more case.dayfile

Calculating case in /home/griselda/WIEN/case/case
on clustersvr

    start       (Tue Oct  5 12:27:53 EDT 2004) with lapw0 (20/20 to go)
>   lapw0 -p    (12:27:53) starting parallel lapw0 at Tue Oct  5 12:27:53 EDT 
2004
-------- .machine1 : 2 processors
fisnode1:2
--------
12.160u 12.530s 0:18.92 130.4%  0+0k 0+0io 11341pf+0w
>   lapw1  -p   (12:28:12) starting parallel lapw1 at Tue Oct  5 12:28:12 EDT 
2004
->  starting parallel LAPW1 jobs at Tue Oct  5 12:28:12 EDT 2004
running LAPW1 in parallel mode (using .machines)
1 number_of_parallel_jobs
**  LAPW1 crashed!
0.150u 0.180s 0:03.27 10.0%     0+0k 0+0io 12563pf+0w

>   stop error

I do not find any clue about the errors.

I will really appreciate your help.

Griselda.