[Wien] MPI Problem

Paul Fons paul-fons at aist.go.jp
Sun Jan 22 23:08:51 CET 2012


Hi,
	I have Wien2K running on a cluster of linux boxes each with 32 cores and connected by 10Gb ethernet.  I have compiled Wien2K by the 3.174 version of Wien2K (I learned the hard way that bugs in the newer versions of the Intel compiler lead to crashes in Wien2K).  I have also installed Intel's MPI.  First, the single process Wien2K, let's say for the TiC case, works fine.  It also works fine when I use a .machines file like

granulaity:1
localhost:1
localhost:1
…  (24 times).

This file leads to parallel execution without error.  I can vary the number of processes by increasing the number of localhost:1 and the number of localhost:1 lines in the file and still everything works fine.  When I try to use mpi to communicate with one process, it works as well.

1:localhost:1  

> lstarting parallel lapw1 at Mon Jan 23 06:49:16 JST 2012
> ->  starting parallel LAPW1 jobs at Mon Jan 23 06:49:16 JST 2012
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
> [1] 22417
>  LAPW1 END
> [1]  + Done                          ( cd $PWD; $t $exe ${def}_$loop.def; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
>      localhost(111) 179.004u 4.635s 0:32.73 561.0%	0+0k 0+26392io 0pf+0w
>    Summary of lapw1para:
>    localhost	 k=111	 user=179.004	 wallclock=32.73
> 179.167u 4.791s 0:35.61 516.5%	0+0k 0+26624io 0pf+0w
> 

Changing the machine file to use more than one process  (the same form of error occurs for more than 2)

1:localhost:2

lead to a run time error in the MPI subsystem.

> starting parallel lapw1 at Mon Jan 23 06:51:04 JST 2012
> ->  starting parallel LAPW1 jobs at Mon Jan 23 06:51:04 JST 2012
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
> [1] 22673
> Fatal error in MPI_Comm_size: Invalid communicator, error stack:
> MPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7ed20c) failed
> MPI_Comm_size(76).: Invalid communicator
> Fatal error in MPI_Comm_size: Invalid communicator, error stack:
> MPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7ed20c) failed
> MPI_Comm_size(76).: Invalid communicator
> [1]  + Done                          ( cd $PWD; $t $ttt; rm -f .lock_$lockfile[$p] ) >> .time1_$loop
>      localhost localhost(111) APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
> 0.037u 0.036s 0:00.06 100.0%	0+0k 0+0io 0pf+0w
> TiC.scf1_1: No such file or directory.
>    Summary of lapw1para:
>    localhost	 k=0	 user=111	 wallclock=0
> 0.105u 0.168s 0:03.21 8.0%	0+0k 0+216io 0pf+0w

I have properly sourced the appropriate runtime environment for the Intel system.  For example, compiling (mpiifort) and running the f90 mpi test program from intel produces:



> mpirun -np 32 /home/paulfons/mpitest/testf90
>  Hello world: rank            0  of           32  running on 
>  asccmp177                                                                      
>                                                  
>  Hello world: rank            1  of           32  running on    (32 times)

Does anyone have any suggestions as to what to try next?  I am not sure how to debug things from here.  I have about 512 nodes that I can use for larger calculations that only can be accessed by mpi (the ssh setup works fine as well by the way).  It would be great to figure out what is wrong.

Thanks.

















Dr. Paul Fons
Functional Nano-phase-change Research Team
Team Leader
Nanodevice Innovation Research Center (NIRC)
National Institute for Advanced Industrial Science & Technology
METI

AIST Central 4, Higashi 1-1-1
Tsukuba, Ibaraki JAPAN 305-8568

tel. +81-298-61-5636
fax. +81-298-61-2939

email: paul-fons at aist.go.jp

The following lines are in a Japanese font

〒305-8562 茨城県つくば市つくば中央東 1-1-1
産業技術総合研究所
ナノ電子デバイス研究センター
相変化新規機能デバイス研究チーム チームリーダー
ポール・フォンス




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20120123/a608980e/attachment.htm>


More information about the Wien mailing list