[Wien] Error in mpi+k point parallelization across multiple nodes

Wed May 6 18:29:10 CEST 2015

See below for my comments.

> Thanks for all the information and suggestions.
>
> I have tried to change -lmkl_blacs_intelmpi_lp64 to -lmkl_blacs_lp64 
> and recompile. However, I got the following error message in the 
> screen output
>
>  LAPW0 END
> [cli_14]: [cli_15]: [cli_6]: aborting job:
> Fatal error in PMPI_Comm_size:
> Invalid communicator, error stack:
> PMPI_Comm_size(110): MPI_Comm_size(comm=0x5b, size=0x7f190c) failed
> PMPI_Comm_size(69).: Invalid communicator
> aborting job:
> Fatal error in PMPI_Comm_size:
> Invalid communicator, error stack:
> PMPI_Comm_size(110): MPI_Comm_size(comm=0x5b, size=0x7f190c) failed
> PMPI_Comm_size(69).: Invalid communicator
> .......
> [z0-5:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 
> 20. MPI process died?
> [z0-5:mpispawn_0][mtpmi_processops] Error while reading PMI socket. 
> MPI process died?
> [z0-5:mpispawn_0][child_handler] MPI process (rank: 14, pid: 11260) 
> exited with status 1
> [z0-5:mpispawn_0][child_handler] MPI process (rank: 3, pid: 11249) 
> exited with status 1
> [z0-5:mpispawn_0][child_handler] MPI process (rank: 6, pid: 11252) 
> exited with status 1
> .....

This is probably because you are using the wrong blacs library.  The 
-lmkl_blacs_lp64 is for MPICH, but you are using a variant of MPICH3.

> Previously I compiled the program with -lmkl_blacs_intelmpi_lp64 and 
> the mpi parallelization on a single node seems to be working. I notice 
> that during the run, the *.error files have finite sizes, but I 
> re-examine them after the job finished and there were no errors 
> written inside (and the files have 0kb now). Does this indicates that 
> the mpi is not running probably at all even on a single node? But I 
> have checked the output result and it's in agreement with the non-mpi 
> results..(for some simple cases)

Sounds like it is working fine on a single node.  At least for now, stay 
with -lmkl_blacs_intelmpi_lp64 as it works for a single node.

As I asked before, did you give us all the error information in the 
case.dayfile and from standard output?  It is not entirely clear in your 
previous posts, but it looks to me that you might have only provided 
information from the case.dayfile and the error files (cat *.error), but 
maybe not from the standard output.  Are you still using the PBS script 
in your old post at 
http://www.mail-archive.com/wien%40zeus.theochem.tuwien.ac.at/msg11770.html 
?  In the script, I can see that the standard output is set to be 
written to a file called wien2k_output.

When it runs fine on a single node, does it always use the same node 
(say z1-17) or does it run fine on other nodes (like z1-18)?

> I also tried changing the mpirun to mpiexec as suggested by Prof. 
> Marks by setting:
> setenv WIEN_MPIRUN "/usr/local/mvapich2-icc/bin/mpiexec -np _NP_ -f 
> _HOSTS_ _EXEC_"
> in the parallel_option. In this case, the program does not run and 
> also does not terminate (qstat on cluster just gives 00:00:00 for the 
> time with a running status)..

At least for now, stay with mpirun since it works on a single node.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20150506/26fc9bf5/attachment-0001.html>