[Wien] machines file

Khuong P. Ong ongpk at ihpc.a-star.edu.sg
Wed Apr 6 12:27:52 CEST 2005


Dear wien users,

  Could somebody help me on .machine file?

  We have just combined the Wien2k for MPI. We tested this with TiC

  Our system information is :
Dual processors AMD 64 Opteron
OS: Fedora Core 1 x86_64
Host name: Darwin
Nodes: Opto0xx ; per node has 2 CPU with the same speed

The Wien2k Version is 25/02/2005

We have files: lapw1para, lapwsopara, lapwdmpara, lapw2para, lapw0_mpi, 
lapw1_mpi, and Lapw2_mpi

For TiC, We tested with 72- k-points

our machines file is

granularity:1
18:opto024:2
18:opto025:2
36:opto030:2
lapw0:opto024 opto025

By using  paratest we obtained:

Test: LAPW1 in parallel mode (using .machines)
Granularity set to 1
Extrafine unset

     klist:       72
     machines:    opto024 opto025 opto030
     procs:       3
     weigh(old):  18 18 36
     sumw:        72
     granularity: 1
     weigh(new):  18 18 36

Distribution of k-point (under ideal conditions)
will be:

1 : opto024(18) 18k
2 : opto025(18) 18k
3 : opto030(36) 36k

  By using this machine file when we ran SCF calculations we met the 
following error:

cycle 1         (Wed Apr  6 17:34:32 SGT 2005)  (20/20 to go)

 >   lapw0 -p    (17:34:32) starting parallel lapw0 at Wed Apr  6 17:34:32 
SGT 2005
-------- .machine1 : 2 processors
opto024
opto025
--------
0.020u 0.010s 0:00.06 50.0%     0+0k 0+0io 2588pf+0w

 >   stop error

When I check STDOUT: I found

1[1]: No match.

I checked files and I saw the following files were not 
generated:  case.vsp, case.vns, and empty files case.clmup/dn

But when I use the following .machine file

granularity:1
18:opto024:2
18:opto025:2
36:opto030:2

LAPW0 ran well, but LAWP1 crashed.

The show dayfile is

       cycle 1   (Wed Apr  6 18:05:39 SGT 2005)  (20/20 to go)

 >   lapw0 -p    (18:05:39) starting parallel lapw0 at Wed Apr  6 18:05:39 
SGT 2005
--------
running lapw0 in single mode
1.960u 0.030s 0:02.13 93.4%     0+0k 0+0io 2502pf+0w
 >   lapw1  -p   (18:05:41) starting parallel lapw1 at Wed Apr  6 18:05:41 
SGT 2005
->  starting parallel LAPW1 jobs at Wed Apr  6 18:05:41 SGT 2005
running LAPW1 in parallel mode (using .machines)
3 number_of_parallel_jobs
**  LAPW1 crashed!
0.030u 0.040s 0:05.34 1.3%      0+0k 0+0io 14149pf+0w

 >   stop error

STDOUT

STOP  LAPW0 END
  STOP
bash: line 1: lapw1: command not found

real    0m0.003s
user    0m0.000s
sys     0m0.000s
bash: line 1: lapw1: command not found

real    0m0.002s
user    0m0.000s
sys     0m0.000s
bash: lapw1: command not found

real    0m0.002s
user    0m0.000s
sys     0m0.001s
cat: No match.

We really do not understand the reason for these errors. But we notice that 
if we run in remote mode, i.e., we use the machine file for host Darwin

granularity:1
18:darwin
18:darwin
36:darwin

then everything was ok.

Could someone help us to clarify these errors and how to overcome this? 
Many thanks.

Best regards,
Khuong
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20050406/3a610d9d/attachment.html


More information about the Wien mailing list