[Wien] .machines file

Valerio Bellini vbellini at unimo.it
Fri Aug 29 19:28:48 CEST 2003


Dear Griselda,

...
> #@node=6
> #@tasks_per_node=8
...
> cat << EOF >.machines
> granularity:1
> 1:bluehorizon.npaci.edu:4
> lapw0:bluehorizon.npaci.edu:1
> EOF
> runsp_lapw -p

The .machines has to be consistent with the number of nodes
and tasks you are allocating with the LL (in order
to not waste resources)
This means that if you want to use 48 processors (as it seems
from your allocated resources) with the fine-grained 
parallelization, the script should look like:

granularity:1
1:bluehorizon.npaci.edu:48
lapw0:bluehorizon.npaci.edu:1

If you use the .machines file you wrote it, you will run with only
4 processors.
In case you want to use the k-points parallelization, and run under
4 processors, than it should look like:

granularity:1
1:bluehorizon.npaci.edu:1
1:bluehorizon.npaci.edu:1
1:bluehorizon.npaci.edu:1
1:bluehorizon.npaci.edu:1
lapw0:bluehorizon.npaci.edu:1

One line for each k-point.

Or, in principle you could use some hybrid parallelization.
If your .machines file looks like:

granularity:1
1:bluehorizon.npaci.edu:12
1:bluehorizon.npaci.edu:12
1:bluehorizon.npaci.edu:12
1:bluehorizon.npaci.edu:12
lapw0:bluehorizon.npaci.edu:1

then in total you use 48 processors, but for each of your 
4 k-point there are 12 processors which take care of the 
eigenvalue problem.
In a shared memory architecture as you surely have, 
unluckily you cannot use an hybrid approach, with the
scripts provided by the wien2k code.
This because you cannot give more than one 'poe' commands.
The only way to do it is to modify largely the scripts
in order to run a step job, but I did not do it so far either
(I am running on a multinode IBM-SP4 machine)

Coming back to the .machines file, when you are on a shared
memory machine you will not use the remote shell to connect 
to the nodes, but everything is doing autonatically by the 'poe'
commands.
Therefore the name of the machines you put in the .machines
file will not be used, so in principle you could put there whatever
you want.

> The error files sent after my job ran:
> 
> stty: tcgetattr: A specified file does not support the ioctl system call.
> hup: Command not found.
> STOP  LAPW0 END
> /paci/ucsd/u11341/WIEN/inst/lapw1cpara: Command not found.

this message seems to indicate that the PATH is not set up correctly so that
it does not find the lapw1para_lapw script
try to give it in the command line...
does it execute it?
if not, you have to add the directory where the scripts are to your
PATH, i.e. put in your .cshrc the line

set path=( $path /paci/ucsd/u11341/WIEN/inst/)

moreover, if one needs to run k-points parallelization
under LoaderLeveler on more than one nodes, one has to change 
the scripts that take care of the parallel execution. 
This because only the 'poe' commands is able to connect dinamically 
with the nodes allocated (which you do not know before) by the LL.

I did the changes, and if you want I could send you the
'sp4-tuned' parallel script files.

But...
It seems that your program does not reach that far, and stops
immediately when it tries to run the lapw1para_lapw script.

Valerio



More information about the Wien mailing list