[Wien] MPI problems
Laurent CHAPUT
Laurent.Chaput at ijl.nancy-universite.fr
Wed Jan 19 00:07:20 CET 2011
Dear Wien2k users
I am experiencing some problems trying to run an mpi calculation on our cluster. I am using the version 10.1 (Release 7/6/2010) with openmi and the intel compiler. I end up with errors in the dayfile and in the error file (see below).
Here is my .machines file :
lapw0:node046.cm.cluster node046.cm.cluster node046.cm.cluster node046.cm.cluster
1:node046.cm.cluster node046.cm.cluster node046.cm.cluster node046.cm.cluster
granularity:1
extrafine:1
I would appreciate any help.
Thanks in advance,
L. Chaput
> lapw0 -p (23:43:38) starting parallel lapw0 at Tue Jan 18 23:43:38 CET 2011
-------- .machine0 : 4 processors
3.906u 0.165s 0:02.14 189.7% 0+0k 0+0io 24pf+0w
> lapw1 -p (23:43:40) starting parallel lapw1 at Tue Jan 18 23:43:40 CET 2011
-> starting parallel LAPW1 jobs at Tue Jan 18 23:43:40 CET 2011
Tue Jan 18 23:43:40 CET 2011 -> Setting up case bi for parallel execution
Tue Jan 18 23:43:40 CET 2011 -> of LAPW1
Tue Jan 18 23:43:40 CET 2011 ->
running LAPW1 in parallel mode (using .machines)
Granularity set to 1
Extrafine set
Tue Jan 18 23:43:40 CET 2011 -> klist: 116
Tue Jan 18 23:43:40 CET 2011 -> machines: node046.cm.cluster node046.cm.cluster node046.cm.cluster node046.cm.cluster
Tue Jan 18 23:43:40 CET 2011 -> procs: 1
Tue Jan 18 23:43:40 CET 2011 -> weigh(old): 1
Tue Jan 18 23:43:40 CET 2011 -> sumw: 1
Tue Jan 18 23:43:40 CET 2011 -> granularity: 1
Tue Jan 18 23:43:40 CET 2011 -> weigh(new): 116
Tue Jan 18 23:43:40 CET 2011 -> Splitting bi.klist.tmp into junks
.machinetmp
1 number_of_parallel_jobs
prepare 1 on node046.cm.cluster
Tue Jan 18 23:43:40 CET 2011 -> Creating klist 1
waiting for all processes to complete
Tue Jan 18 23:43:42 CET 2011 -> all processes done.
Tue Jan 18 23:43:43 CET 2011 -> CPU TIME summary:
Tue Jan 18 23:43:43 CET 2011 -> ================
node046.cm.cluster node046.cm.cluster node046.cm.cluster node046.cm.cluster(116) Child id 3 SIGSEGV, contact developers
Child id 1 SIGSEGV, contact developers
Child id 2 SIGSEGV, contact developers
Child id 0 SIGSEGV, contact developers
0.080u 0.077s 0:01.13 13.2% 0+0k 0+0io 16pf+0w
Summary of lapw1para:
node046.cm.cluster k=0 user=0 wallclock=6960
0.122u 0.397s 0:03.22 15.8% 0+0k 0+0io 16pf+0w
> lapw2 -p (23:43:43) running LAPW2 in parallel mode
** LAPW2 crashed!
0.029u 0.085s 0:00.12 83.3% 0+0k 0+0io 0pf+0w
error: command /CALCULS/lchaput/code/wien2k/lapw2para lapw2.def failed
> stop error
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
And this in the error file.
LAPW0 END
LAPW0 END
LAPW0 END
LAPW0 END
.machinetmp222: No such file or directory
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 8292600.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 24638 on
node node046 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[node046:24635] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
[node046:24635] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
w2k_dispatch_signal(): received: Terminated
bi.scf1_1: No such file or directory.
FERMI - Error
cp: cannot stat `.in.tmp': No such file or directory
rm: cannot remove `.in.tmp': No such file or directory
rm: cannot remove `.in.tmp1': No such file or directory
More information about the Wien
mailing list