[Wien] MPI problems

Laurent CHAPUT Laurent.Chaput at ijl.nancy-universite.fr
Wed Jan 19 00:07:20 CET 2011


Dear Wien2k users

I am experiencing some problems trying to run an mpi calculation on our cluster. I am using the version 10.1 (Release 7/6/2010) with openmi and the intel compiler. I end up with errors in the dayfile and in the error file (see below).
Here is my .machines file :

lapw0:node046.cm.cluster  node046.cm.cluster  node046.cm.cluster  node046.cm.cluster
1:node046.cm.cluster node046.cm.cluster node046.cm.cluster node046.cm.cluster
granularity:1
extrafine:1

I would appreciate any help.
Thanks in advance,
L. Chaput


>   lapw0 -p    (23:43:38) starting parallel lapw0 at Tue Jan 18 23:43:38 CET 2011
-------- .machine0 : 4 processors
3.906u 0.165s 0:02.14 189.7%    0+0k 0+0io 24pf+0w
>   lapw1  -p   (23:43:40) starting parallel lapw1 at Tue Jan 18 23:43:40 CET 2011
->  starting parallel LAPW1 jobs at Tue Jan 18 23:43:40 CET 2011
Tue Jan 18 23:43:40 CET 2011 -> Setting up case bi for parallel execution
Tue Jan 18 23:43:40 CET 2011 -> of LAPW1
Tue Jan 18 23:43:40 CET 2011 ->
running LAPW1 in parallel mode (using .machines)
Granularity set to 1
Extrafine set
Tue Jan 18 23:43:40 CET 2011 -> klist:       116
Tue Jan 18 23:43:40 CET 2011 -> machines:    node046.cm.cluster node046.cm.cluster node046.cm.cluster node046.cm.cluster
Tue Jan 18 23:43:40 CET 2011 -> procs:       1
Tue Jan 18 23:43:40 CET 2011 -> weigh(old):  1
Tue Jan 18 23:43:40 CET 2011 -> sumw:        1
Tue Jan 18 23:43:40 CET 2011 -> granularity: 1
Tue Jan 18 23:43:40 CET 2011 -> weigh(new):  116
Tue Jan 18 23:43:40 CET 2011 -> Splitting bi.klist.tmp into junks
.machinetmp
1 number_of_parallel_jobs
prepare 1 on node046.cm.cluster
Tue Jan 18 23:43:40 CET 2011 -> Creating klist 1
waiting for all processes to complete
Tue Jan 18 23:43:42 CET 2011 -> all processes done.
Tue Jan 18 23:43:43 CET 2011 -> CPU TIME summary:
Tue Jan 18 23:43:43 CET 2011 -> ================
     node046.cm.cluster node046.cm.cluster node046.cm.cluster node046.cm.cluster(116)  Child id           3 SIGSEGV, contact developers
 Child id           1 SIGSEGV, contact developers
 Child id           2 SIGSEGV, contact developers
 Child id           0 SIGSEGV, contact developers
0.080u 0.077s 0:01.13 13.2%     0+0k 0+0io 16pf+0w
   Summary of lapw1para:
   node046.cm.cluster    k=0     user=0  wallclock=6960
0.122u 0.397s 0:03.22 15.8%     0+0k 0+0io 16pf+0w
>   lapw2 -p    (23:43:43) running LAPW2 in parallel mode
**  LAPW2 crashed!
0.029u 0.085s 0:00.12 83.3%     0+0k 0+0io 0pf+0w
error: command   /CALCULS/lchaput/code/wien2k/lapw2para lapw2.def   failed

>   stop error
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
And this in the error file.

LAPW0 END
 LAPW0 END
 LAPW0 END
 LAPW0 END
.machinetmp222: No such file or directory
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
w2k_dispatch_signal(): received: Segmentation fault
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 8292600.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 24638 on
node node046 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[node046:24635] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
[node046:24635] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
w2k_dispatch_signal(): received: Terminated
bi.scf1_1: No such file or directory.
FERMI - Error
cp: cannot stat `.in.tmp': No such file or directory
rm: cannot remove `.in.tmp': No such file or directory
rm: cannot remove `.in.tmp1': No such file or directory


More information about the Wien mailing list