[Wien] MPI Problem

Laurence Marks L-marks at northwestern.edu
Mon Jan 23 00:08:50 CET 2012


A guess: you are using the wrong version of blacs. You need a
-lmkl_blacs_intelmpi_XX
where "XX" is the one for your system. I have seen this give the same error.

Use http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/

For reference, with openmpi it is _openmpi_ instead of _intelmpi_, and
similarly for sgi.

2012/1/22 Paul Fons <paul-fons at aist.go.jp>:
>
> Hi,
> I have Wien2K running on a cluster of linux boxes each with 32 cores and
> connected by 10Gb ethernet.  I have compiled Wien2K by the 3.174 version of
> Wien2K (I learned the hard way that bugs in the newer versions of the Intel
> compiler lead to crashes in Wien2K).  I have also installed Intel's MPI.
>  First, the single process Wien2K, let's say for the TiC case, works fine.
>  It also works fine when I use a .machines file like
>
> granulaity:1
> localhost:1
> localhost:1
> …  (24 times).
>
> This file leads to parallel execution without error.  I can vary the number
> of processes by increasing the number of localhost:1 and the number of
> localhost:1 lines in the file and still everything works fine.  When I try
> to use mpi to communicate with one process, it works as well.
>
> 1:localhost:1
>
> lstarting parallel lapw1 at Mon Jan 23 06:49:16 JST 2012
>
> ->  starting parallel LAPW1 jobs at Mon Jan 23 06:49:16 JST 2012
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
> [1] 22417
>  LAPW1 END
> [1]  + Done                          ( cd $PWD; $t $exe ${def}_$loop.def; rm
> -f .lock_$lockfile[$p] ) >> .time1_$loop
>      localhost(111) 179.004u 4.635s 0:32.73 561.0%	0+0k 0+26392io 0pf+0w
>    Summary of lapw1para:
>    localhost	 k=111	 user=179.004	 wallclock=32.73
> 179.167u 4.791s 0:35.61 516.5%	0+0k 0+26624io 0pf+0w
>
>
> Changing the machine file to use more than one process  (the same form of
> error occurs for more than 2)
>
> 1:localhost:2
>
> lead to a run time error in the MPI subsystem.
>
> starting parallel lapw1 at Mon Jan 23 06:51:04 JST 2012
> ->  starting parallel LAPW1 jobs at Mon Jan 23 06:51:04 JST 2012
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
> [1] 22673
> Fatal error in MPI_Comm_size: Invalid communicator, error stack:
> MPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7ed20c) failed
> MPI_Comm_size(76).: Invalid communicator
> Fatal error in MPI_Comm_size: Invalid communicator, error stack:
> MPI_Comm_size(123): MPI_Comm_size(comm=0x5b, size=0x7ed20c) failed
> MPI_Comm_size(76).: Invalid communicator
> [1]  + Done                          ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
>      localhost localhost(111) APPLICATION TERMINATED WITH THE EXIT STRING:
> Hangup (signal 1)
> 0.037u 0.036s 0:00.06 100.0%	0+0k 0+0io 0pf+0w
> TiC.scf1_1: No such file or directory.
>    Summary of lapw1para:
>    localhost	 k=0	 user=111	 wallclock=0
> 0.105u 0.168s 0:03.21 8.0%	0+0k 0+216io 0pf+0w
>
>
> I have properly sourced the appropriate runtime environment for the Intel
> system.  For example, compiling (mpiifort) and running the f90 mpi test
> program from intel produces:
>
>
>
> mpirun -np 32 /home/paulfons/mpitest/testf90
>  Hello world: rank            0  of           32  running on
>  asccmp177
>
>
>  Hello world: rank            1  of           32  running on    (32 times)
>
>
> Does anyone have any suggestions as to what to try next?  I am not sure how
> to debug things from here.  I have about 512 nodes that I can use for larger
> calculations that only can be accessed by mpi (the ssh setup works fine as
> well by the way).  It would be great to figure out what is wrong.
>
> Thanks.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Dr. Paul Fons
> Functional Nano-phase-change Research Team
> Team Leader
> Nanodevice Innovation Research Center (NIRC)
> National Institute for Advanced Industrial Science & Technology
> METI
>
> AIST Central 4, Higashi 1-1-1
> Tsukuba, Ibaraki JAPAN 305-8568
>
> tel. +81-298-61-5636
> fax. +81-298-61-2939
>
> email: paul-fons at aist.go.jp
>
> The following lines are in a Japanese font
>
> 〒305-8562 茨城県つくば市つくば中央東 1-1-1
> 産業技術総合研究所
> ナノ電子デバイス研究センター
> 相変化新規機能デバイス研究チーム チームリーダー
> ポール・フォンス
>
>
>
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>



-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what
nobody else has thought"
Albert Szent-Gyorgi


More information about the Wien mailing list