[Wien] MPI problems

Laurence Marks L-marks at northwestern.edu
Wed Jan 19 11:28:24 CET 2011


Most probably you have a problem in the initial setup, less likely is an mpi
problem.

Please verify first that this particular case will run in non mpi mode,
k-point parallel.

If it does, please check the mail list history for openmpi. You need to
compile correctly, use a recent enough openmpi version and avoid issues with
openmpi not exporting evironmental parameters.
On Jan 18, 2011 5:07 PM, "Laurent CHAPUT" <
Laurent.Chaput at ijl.nancy-universite.fr> wrote:
> Dear Wien2k users
>
> I am experiencing some problems trying to run an mpi calculation on our
cluster. I am using the version 10.1 (Release 7/6/2010) with openmi and the
intel compiler. I end up with errors in the dayfile and in the error file
(see below).
> Here is my .machines file :
>
> lapw0:node046.cm.cluster node046.cm.cluster node046.cm.cluster
node046.cm.cluster
> 1:node046.cm.cluster node046.cm.cluster node046.cm.cluster
node046.cm.cluster
> granularity:1
> extrafine:1
>
> I would appreciate any help.
> Thanks in advance,
> L. Chaput
>
>
>> lapw0 -p (23:43:38) starting parallel lapw0 at Tue Jan 18 23:43:38 CET
2011
> -------- .machine0 : 4 processors
> 3.906u 0.165s 0:02.14 189.7% 0+0k 0+0io 24pf+0w
>> lapw1 -p (23:43:40) starting parallel lapw1 at Tue Jan 18 23:43:40 CET
2011
> -> starting parallel LAPW1 jobs at Tue Jan 18 23:43:40 CET 2011
> Tue Jan 18 23:43:40 CET 2011 -> Setting up case bi for parallel execution
> Tue Jan 18 23:43:40 CET 2011 -> of LAPW1
> Tue Jan 18 23:43:40 CET 2011 ->
> running LAPW1 in parallel mode (using .machines)
> Granularity set to 1
> Extrafine set
> Tue Jan 18 23:43:40 CET 2011 -> klist: 116
> Tue Jan 18 23:43:40 CET 2011 -> machines: node046.cm.cluster
node046.cm.cluster node046.cm.cluster node046.cm.cluster
> Tue Jan 18 23:43:40 CET 2011 -> procs: 1
> Tue Jan 18 23:43:40 CET 2011 -> weigh(old): 1
> Tue Jan 18 23:43:40 CET 2011 -> sumw: 1
> Tue Jan 18 23:43:40 CET 2011 -> granularity: 1
> Tue Jan 18 23:43:40 CET 2011 -> weigh(new): 116
> Tue Jan 18 23:43:40 CET 2011 -> Splitting bi.klist.tmp into junks
> .machinetmp
> 1 number_of_parallel_jobs
> prepare 1 on node046.cm.cluster
> Tue Jan 18 23:43:40 CET 2011 -> Creating klist 1
> waiting for all processes to complete
> Tue Jan 18 23:43:42 CET 2011 -> all processes done.
> Tue Jan 18 23:43:43 CET 2011 -> CPU TIME summary:
> Tue Jan 18 23:43:43 CET 2011 -> ================
> node046.cm.cluster node046.cm.cluster node046.cm.cluster
node046.cm.cluster(116) Child id 3 SIGSEGV, contact developers
> Child id 1 SIGSEGV, contact developers
> Child id 2 SIGSEGV, contact developers
> Child id 0 SIGSEGV, contact developers
> 0.080u 0.077s 0:01.13 13.2% 0+0k 0+0io 16pf+0w
> Summary of lapw1para:
> node046.cm.cluster k=0 user=0 wallclock=6960
> 0.122u 0.397s 0:03.22 15.8% 0+0k 0+0io 16pf+0w
>> lapw2 -p (23:43:43) running LAPW2 in parallel mode
> ** LAPW2 crashed!
> 0.029u 0.085s 0:00.12 83.3% 0+0k 0+0io 0pf+0w
> error: command /CALCULS/lchaput/code/wien2k/lapw2para lapw2.def failed
>
>> stop error
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> And this in the error file.
>
> LAPW0 END
> LAPW0 END
> LAPW0 END
> LAPW0 END
> .machinetmp222: No such file or directory
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> w2k_dispatch_signal(): received: Segmentation fault
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
> with errorcode 8292600.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 1 with PID 24638 on
> node node046 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> [node046:24635] 3 more processes have sent help message help-mpi-api.txt /
mpi-abort
> [node046:24635] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages
> w2k_dispatch_signal(): received: Terminated
> bi.scf1_1: No such file or directory.
> FERMI - Error
> cp: cannot stat `.in.tmp': No such file or directory
> rm: cannot remove `.in.tmp': No such file or directory
> rm: cannot remove `.in.tmp1': No such file or directory
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20110119/dbef58dd/attachment.htm>


More information about the Wien mailing list