[Wien] machines file
Jorissen Kevin
Kevin.Jorissen at ua.ac.be
Wed Apr 6 20:11:59 CEST 2005
Hi there,
here are just a few thoughts ...
0/ Please provide us with the following information :
* your MPI software
* your fortran compiler and compilation settings
* the libraries you are using
****** CHECKING YOUR CONFIGURATION ********
1/ Could you try the following machines file :
granularity:1
18:opto024
18:opto025
36:opto030
This is what I would call 'running remotely (without MPI)'. It would be nice to know that this works. (I'm quite suspicious about not finding lapw1 on a node !)
2/ First you should check that the MPI commands are passed on correctly by looking at the parallel scripts lapw1para etc (The reason I say this : using your second machines file, the program should not be looking for executables lapw1, but for executables lapw1_mpi !! Adding the -x switch on the first line of lapw1para gives you extra information).
******** CHECKING YOUR COMPILATION *******
3/ Probably :
Your mpi executables don't work (the normal executables do) - which may be why lapw0_mpi crashes.
As far as I understand, the mpi version is much less 'well established' than the normal version.
Unless you have the luck that somebody uses it on exactly the same configuration as you, you may well have to go into the software yourself.
The fortran itself might need checking ... Some people on the ML may be able to help you with that (not me, though).
******** AND BY THE WAY ************
4/ If I remember correctly, P. Blaha pointed out in the past that you only gain from using MPI if you can use more than 2 processors, so your setup might not be worth the trouble.
5/ Have you searched the ML archives? There are MPI related issues every now and then.
good luck,
Kevin Jorissen
EMAT - Electron Microscopy for Materials Science (http://webhost.ua.ac.be/emat/)
Dept. of Physics
UA - Universiteit Antwerpen
Groenenborgerlaan 171
B-2020 Antwerpen
Belgium
tel +32 3 2653249
fax + 32 3 2653257
e-mail kevin.jorissen at ua.ac.be
________________________________
Van: wien-admin at zeus.theochem.tuwien.ac.at namens Khuong P. Ong
Verzonden: wo 6-4-2005 12:27
Aan: wien at zeus.theochem.tuwien.ac.at
Onderwerp: [Wien] machines file
Dear wien users,
Could somebody help me on .machine file?
We have just combined the Wien2k for MPI. We tested this with TiC
Our system information is :
Dual processors AMD 64 Opteron
OS: Fedora Core 1 x86_64
Host name: Darwin
Nodes: Opto0xx ; per node has 2 CPU with the same speed
The Wien2k Version is 25/02/2005
We have files: lapw1para, lapwsopara, lapwdmpara, lapw2para, lapw0_mpi, lapw1_mpi, and Lapw2_mpi
For TiC, We tested with 72- k-points
our machines file is
granularity:1
18:opto024:2
18:opto025:2
36:opto030:2
lapw0:opto024 opto025
By using paratest we obtained:
Test: LAPW1 in parallel mode (using .machines)
Granularity set to 1
Extrafine unset
klist: 72
machines: opto024 opto025 opto030
procs: 3
weigh(old): 18 18 36
sumw: 72
granularity: 1
weigh(new): 18 18 36
Distribution of k-point (under ideal conditions)
will be:
1 : opto024(18) 18k
2 : opto025(18) 18k
3 : opto030(36) 36k
By using this machine file when we ran SCF calculations we met the following error:
cycle 1
(Wed
Apr 6 17:34:32 SGT 2005) (20/20 to go)
> lapw0
-p (17:34:32) starting parallel
lapw0 at Wed Apr 6 17:34:32 SGT 2005
-------- .machine1 : 2 processors
opto024
opto025
--------
0.020u 0.010s 0:00.06
50.0% 0+0k 0+0io 2588pf+0w
> stop error
When I check STDOUT: I found
1[1]: No match.
I checked files and I saw the following files were not generated: case.vsp, case.vns, and empty files case.clmup/dn
But when I use the following .machine file
granularity:1
18:opto024:2
18:opto025:2
36:opto030:2
LAPW0 ran well, but LAWP1 crashed.
The show dayfile is
cycle 1
(Wed Apr 6 18:05:39 SGT 2005)
(20/20 to go)
> lapw0
-p (18:05:39) starting parallel
lapw0 at Wed Apr 6 18:05:39 SGT 2005
--------
running lapw0 in single mode
1.960u 0.030s 0:02.13
93.4% 0+0k 0+0io 2502pf+0w
> lapw1 -p (18:05:41)
starting parallel lapw1 at Wed Apr 6 18:05:41 SGT 2005
-> starting parallel LAPW1 jobs at Wed Apr 6 18:05:41 SGT
2005
running LAPW1 in parallel mode (using .machines)
3 number_of_parallel_jobs
** LAPW1 crashed!
0.030u 0.040s 0:05.34
1.3% 0+0k 0+0io
14149pf+0w
> stop error
STDOUT
STOP LAPW0 END
STOP
bash: line 1: lapw1: command not found
real 0m0.003s
user 0m0.000s
sys 0m0.000s
bash: line 1: lapw1: command not found
real 0m0.002s
user 0m0.000s
sys 0m0.000s
bash: lapw1: command not found
real 0m0.002s
user 0m0.000s
sys 0m0.001s
cat: No match.
We really do not understand the reason for these errors. But we notice that if we run in remote mode, i.e., we use the machine file for host Darwin
granularity:1
18:darwin
18:darwin
36:darwin
then everything was ok.
Could someone help us to clarify these errors and how to overcome this? Many thanks.
Best regards,
Khuong
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 10641 bytes
Desc: not available
Url : http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20050406/9f9ba6b2/attachment.bin
More information about the Wien
mailing list