[Wien] machines file

Jorissen Kevin Kevin.Jorissen at ua.ac.be
Wed Apr 6 20:11:59 CEST 2005


Hi there,
here are just a few thoughts ...
 
 
 
 
0/  Please provide us with the following information :
* your MPI software
* your fortran compiler and compilation settings
* the libraries you are using
 
 
****** CHECKING YOUR CONFIGURATION ********
 
 
1/  Could you try the following machines file :
granularity:1
18:opto024
18:opto025
36:opto030

This is what I would call 'running remotely (without MPI)'.  It would be nice to know that this works.  (I'm quite suspicious about not finding lapw1 on a node !)
 
2/   First you should check that the MPI commands are passed on correctly by looking at the parallel scripts lapw1para etc (The reason I say this : using your second machines file, the program should not be looking for executables lapw1, but for executables lapw1_mpi !!  Adding the -x switch on the first line of lapw1para gives you extra information).  
 
 
******** CHECKING YOUR COMPILATION *******
 
3/  Probably :
 
Your mpi executables don't work (the normal executables do) - which may be why lapw0_mpi crashes.
 
As far as I understand, the mpi version is much less 'well established' than the normal version.
Unless you have the luck that somebody uses it on exactly the same configuration as you, you may well have to go into the software yourself.
The fortran itself might need checking ...  Some people on the ML may be able to help you with that (not me, though).
 
******** AND BY THE WAY ************
 
 
4/  If I remember correctly, P. Blaha pointed out in the past that you only gain from using MPI if you can use more than 2 processors, so your setup might not be worth the trouble.
 
 
5/  Have you searched the ML archives?  There are MPI related issues every now and then.
 
 
 
good luck,
 
 
 
Kevin Jorissen
 
EMAT - Electron Microscopy for Materials Science   (http://webhost.ua.ac.be/emat/)
Dept. of Physics
 
UA - Universiteit Antwerpen
Groenenborgerlaan 171
B-2020 Antwerpen
Belgium
 
tel  +32 3 2653249
fax + 32 3 2653257
e-mail kevin.jorissen at ua.ac.be
 

________________________________

Van: wien-admin at zeus.theochem.tuwien.ac.at namens Khuong P. Ong
Verzonden: wo 6-4-2005 12:27
Aan: wien at zeus.theochem.tuwien.ac.at
Onderwerp: [Wien] machines file


Dear wien users,

 Could somebody help me on .machine file? 

 We have just combined the Wien2k for MPI. We tested this with TiC 
 
 Our system information is : 
Dual processors AMD 64 Opteron
OS: Fedora Core 1 x86_64
Host name: Darwin
Nodes: Opto0xx ; per node has 2 CPU with the same speed

The Wien2k Version is 25/02/2005

We have files: lapw1para, lapwsopara, lapwdmpara, lapw2para, lapw0_mpi, lapw1_mpi, and Lapw2_mpi

For TiC, We tested with 72- k-points

our machines file is

granularity:1
18:opto024:2
18:opto025:2
36:opto030:2
lapw0:opto024 opto025

By using  paratest we obtained:


Test: LAPW1 in parallel mode (using .machines)
Granularity set to 1
Extrafine unset

    klist:       72
    machines:    opto024 opto025 opto030
    procs:       3
    weigh(old):  18 18 36
    sumw:        72
    granularity: 1
    weigh(new):  18 18 36

Distribution of k-point (under ideal conditions)
will be:

1 : opto024(18) 18k 
2 : opto025(18) 18k 
3 : opto030(36) 36k 


 By using this machine file when we ran SCF calculations we met the following error:


cycle 1
        (Wed
Apr  6 17:34:32 SGT 2005)  (20/20 to go)

>   lapw0
-p    (17:34:32) starting parallel
lapw0 at Wed Apr  6 17:34:32 SGT 2005
-------- .machine1 : 2 processors
opto024
opto025
--------
0.020u 0.010s 0:00.06
50.0%     0+0k 0+0io 2588pf+0w

>   stop error

When I check STDOUT: I found


1[1]: No match.


I checked files and I saw the following files were not generated:  case.vsp, case.vns, and empty files case.clmup/dn

But when I use the following .machine file

granularity:1
18:opto024:2
18:opto025:2
36:opto030:2

LAPW0 ran well, but LAWP1 crashed.

The show dayfile is


      cycle 1
  (Wed Apr  6 18:05:39 SGT 2005)
 (20/20 to go)

>   lapw0
-p    (18:05:39) starting parallel
lapw0 at Wed Apr  6 18:05:39 SGT 2005
--------
running lapw0 in single mode
1.960u 0.030s 0:02.13
93.4%     0+0k 0+0io 2502pf+0w
>   lapw1  -p   (18:05:41)
starting parallel lapw1 at Wed Apr  6 18:05:41 SGT 2005
->  starting parallel LAPW1 jobs at Wed Apr  6 18:05:41 SGT
2005
running LAPW1 in parallel mode (using .machines)
3 number_of_parallel_jobs
**  LAPW1 crashed!
0.030u 0.040s 0:05.34
1.3%      0+0k 0+0io
14149pf+0w

>   stop error


STDOUT


STOP  LAPW0 END
 STOP
bash: line 1: lapw1: command not found

real    0m0.003s
user    0m0.000s
sys     0m0.000s
bash: line 1: lapw1: command not found

real    0m0.002s
user    0m0.000s
sys     0m0.000s
bash: lapw1: command not found

real    0m0.002s
user    0m0.000s
sys     0m0.001s
cat: No match.


We really do not understand the reason for these errors. But we notice that if we run in remote mode, i.e., we use the machine file for host Darwin

granularity:1
18:darwin
18:darwin
36:darwin

then everything was ok.  

Could someone help us to clarify these errors and how to overcome this? Many thanks.

Best regards,
Khuong

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 10641 bytes
Desc: not available
Url : http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20050406/9f9ba6b2/attachment.bin


More information about the Wien mailing list