[Wien] I still have problem with wienk in parallel mode

Laurence Marks L-marks at northwestern.edu
Wed Dec 28 22:49:25 CET 2011


Suggestions, assuming that all your computers are dual quadcores:
a) Use as .machines file
1:bodesking.uefs.br:8
1:compute-0-0.local:8
1:compute-0-1.local:8

This will run 3 tasks each using mpi with 8 cores on each computer. If
they are not dual quadcores but only have (for instance) 4 cores
change the "8" to "4".

b) If this still fails, do "tail *.scf1* " and "tail *.output1*" and
see if only one failed, or all failed. I assume you are using a
terminal not just w2web. Have you checked the error files?

c) Do you have ssh without password setup? For instance you need to be
able to do "ssh compute-0-0.local" and not be asked for a password. If
it is not setup, you may have to as many mpi versions need it.

d) Do "cd $WIENROOT ; cp lapw1para lapw1para_hold" then edit lapw1para
and change the first line to "#!/bin/csh -xf" . This will give you
masses of output, and may show an error. If nothing else it will show
a command such as "mpirun ..." You can then paste this particular
command and run it at the terminal to get more information.


2011/12/28 Nilton <nilton.dantas at gmail.com>:
> Dear L. Marks
> thanks a lot for the answer. Let's to my comments
>
> 2011/12/27 Laurence Marks <L-marks at northwestern.edu>
>>
>> It is hard to know as you have not provided us with enough
>> information, so we can only guess. Most likely is that you have setup
>> the problem wrong, for instance bad RMTs, bad case.in1c or other. Read
>> the file lapw1.error to see if it has anything, and also the various
>> output files. Beyond this:
>
>
> The setup is correct because I can run wien in sequential version
>
>>
>>
>> a) Did you compile the mpi versions? If not, then what you are using
>> will not work. There are two ways to run Wien2k in parallel, one uses
>> mpi and is needed for big jobs, the other does not use mpi and is
>> often simpler for small jobs.
>
>
> Yes, I am using wien2k10.1. I tried to compile wien2k11 but I got some
> errors in lapw2(c)_mpi compilation, so I gave up
>
>>
>> b) Edit parallel_options and put "setenv debug 1" in (remove it later)
>> then do "x lapw1 -p" from the terminal. This will give you more
>> output.
>
>
> I did, it seems ok but runnig in single mode. Please, see the output below
>
>>
>> c) Check that you have ssh enabled to the compute nodes (I don't think
>> you need the .local at the end)
>
>
> My ssh is working. I can log on the nodes of my cluster.
>>
>>
>> A comment. You have setup your .machines file to run 5 tasks for
>> lapw1, each using 4 cpu's. Some mpi versions are not smart and with
>> what you have will run both tasks on compute-0-0 using the same cores.
>
> granularity:1
> 1:bodesking.uefs.br:1
> 1:bodesking.uefs.br:1
> 1:compute-0-0.local:1
> 1:compute-0-0.local:1
> 1:compute-0-0.local:1
> 1:compute-0-0.local:1
> 1:compute-0-1.local:1
> 1:compute-0-1.local:1
> 1:compute-0-1.local:1
> 1:compute-0-1.local:1
>
> with this file if I type run_lapw -p I get 11 processes for lapw1, and 2 in
> all computers listed but not lapw1_mpi or lapw2_mpi. This is the point: how
> can I setup .machines in order to run wien2k with mpi libraries. Below you
> can see the config of parallel_options file
>
> setenv USE_REMOTE 1
> setenv MPI_REMOTE 1
> setenv WIEN_GRANULARITY 1
> setenv WIEN_MPIRUN "mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_"
>
>
> ------------------------------The output of x lapw0 -p and x lapw1 -p
> [nilton at bodesking case]$ x lapw0 -p
> starting parallel lapw0 at Wed Dec 28 18:29:26 BRT 2011
> -------- .machine0 : processors
> running lapw0 in single mode
>  LAPW0 END
> 14.599u 0.400s 0:15.01 99.8%    0+0k 0+0io 0pf+0w
> [nilton at bodesking case]$ x lapw1 -p
> starting parallel lapw1 at Wed Dec 28 18:29:46 BRT 2011
> ->  starting parallel LAPW1 jobs at Wed Dec 28 18:29:46 BRT 2011
> running LAPW1 in parallel mode (using .machines)
> 10 number_of_parallel_jobs
> [1] 11587
> [2] 11724
> [3] 11856
> [4] 11887
> [5] 11917
> [6] 11944
> [7] 11976
> [8] 12002
> [9] 12033
>  LAPW1 END
> [1]    Done                          ( ( $remote $machine[$p]  ...
> [1] 12066
>  LAPW1 END
> [2]    Done                          ( ( $remote $machine[$p]  ...
> [2] 12108
>  LAPW1 END
> [3]    Done                          ( ( $remote $machine[$p]  ...
> [3] 12249
>  LAPW1 END
> [4]    Done                          ( ( $remote $machine[$p]  ...
>  LAPW1 END
>  LAPW1 END
>  LAPW1 END
>  LAPW1 END
>  LAPW1 END
>  LAPW1 END
>  LAPW1 END
>  LAPW1 END
> [3]    Done                          ( ( $remote $machine[$p]  ...
> [2]  + Done                          ( ( $remote $machine[$p]  ...
> [1]  + Done                          ( ( $remote $machine[$p]  ...
> [9]  + Done                          ( ( $remote $machine[$p]  ...
> [8]  + Done                          ( ( $remote $machine[$p]  ...
> [7]  + Done                          ( ( $remote $machine[$p]  ...
> [6]  + Done                          ( ( $remote $machine[$p]  ...
> [5]  + Done                          ( ( $remote $machine[$p]  ...
>      bodesking.uefs.br(3) 7.766u 0.476s 8.26 99.76%      0+0k 0+0io 0pf+0w
>      bodesking.uefs.br(3) 7.916u 0.225s 8.18 99.46%      0+0k 0+0io 0pf+0w
>      compute-0-0.local(3) 8.529u 0.300s 8.92 98.97%      0+0k 0+0io 0pf+0w
>      compute-0-0.local(3) 8.899u 0.185s 9.2 98.74%      0+0k 0+0io 0pf+0w
>      compute-0-0.local(3) 8.640u 0.260s 9.00 98.82%      0+0k 0+0io 0pf+0w
>      compute-0-0.local(3) 8.335u 0.249s 8.90 96.35%      0+0k 0+0io 0pf+0w
>      compute-0-1.local(3) 10.687u 0.250s 11.08 98.69%      0+0k 0+0io 0pf+0w
>      compute-0-1.local(3) 10.632u 0.294s 11.03 98.99%      0+0k 0+0io 0pf+0w
>      compute-0-1.local(3) 10.708u 0.206s 11.07 98.51%      0+0k 0+0io 0pf+0w
>      compute-0-1.local(3) 10.573u 0.310s 11.18 97.27%      0+0k 0+0io 0pf+0w
>      bodesking.uefs.br(3) 7.794u 0.343s 8.19 99.35%      0+0k 0+0io 0pf+0w
>      bodesking.uefs.br(3) 8.336u 0.209s 8.59 99.48%      0+0k 0+0io 0pf+0w
>    Summary of lapw1para:
>    bodesking.uefs.br     k=12    user=31.812     wallclock=2391.25
>    compute-0-0.local     k=12    user=34.403     wallclock=2554.08
>    compute-0-1.local     k=12    user=42.6       wallclock=3055.06
> 0.272u 0.446s 0:22.32 3.1%      0+0k 0+0io 0pf+0w
>
>
> Nilton
> --
> Nilton S. Dantas
> Universidade Estadual de Feira de Santana
> Departamento de Ciências Exatas
> Área de Informática
> Av. Transnordestina, S/N, Bairro Novo Horizonte
> CEP 44036900 - Feira de Santana, Bahia, Brasil
> Tel./Fax +55 75 31618086
> http://www2.ecomp.uefs.br/
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>



-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what
nobody else has thought"
Albert Szent-Gyorgi


More information about the Wien mailing list