[Wien] Fwd: MPI segmentation fault
Laurence Marks
L-marks at northwestern.edu
Sat Jan 30 23:11:07 CET 2010
OK, looks like you have cleaned up many of the issues. The SIGSEV is
(I think) now one of two things:
a) memory limitations (how much do you have, 8Gb or 16-24 Gb ?)
While the process is running do a "top" and see how much memory is
allocated and whether this is essentially all. If you have ganglia
available you can use this to see readily. Similar information is also
available in cat /proc/meminfo or using the nmon utility from IBM
(google it, it is easy to compile). I suspect that you are simply
running out of memory, running too many tasks at the same time on one
machine -- you would need to use more machines so the memory usage on
any one is smaller.
b) stacksize issue (less likely)
This is an issue with openmpi, see
http://www.open-mpi.org/community/lists/users/2008/09/6491.php . In a
nutshell, the stacksize limit is not an environmental parameter and
there is no direct way to set it correctly with openmpi except to use
a wrapper. I have a patch for this, but lets' try something simpler
first (which I think is OK, but I might have it slightly wrong).
* Create a file called wrap.sh in your search path (e.g. ~/bin or even
$WIENROOT) and put in it
#!/bin/bash
source $HOME/.bashrc
ulimit -s unlimited
#write a line so we know we got here
echo "Hello Fhorkul"
$1 $2 $3 $4
* Do a "chmod a+x wrap.sh" (appropriate location of course)
* Edit parallel_options in $WIENROOT so it reads
setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH -np _NP_
-machinefile _HOSTS_ wrap.sh _EXEC_"
This does the same as is described in the email link above, forcing
the Wien2k mpi commands to be executed from within a bash shell so
parameters are setup. If this works then I can provide details for a
more general patch.
2010/1/30 Md. Fhokrul Islam <fislam at hotmail.com>:
> Hi Marks,
>
> I have followed your suggestions and have used openmpi 1.4.1 compiled
> with icc.
> I also have compiled fftw with cc instead of gcc and recompiled Wien2k with
> mpirun option
> in parallel_options:
>
> current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_ -x
> LD_LIBRARY_PATH
>
> Although I didn't get segmentation fault but the job still crashes at lapw1
> with a different error
> message. I have pasted case.dayfile and case.error below along with
> ompi_info and stacksize
> info. I am not even sure where to look for the solution. Please let me know
> if you have any
> suggestions regarding this MPI problem.
>
> Thanks,
> Fhokrul
>
> case.dayfile:
>
> cycle 1 (Sat Jan 30 16:49:55 CET 2010) (200/99 to go)
>
>> lapw0 -p (16:49:55) starting parallel lapw0 at Sat Jan 30 16:49:56
>> CET 2010
> -------- .machine0 : 4 processors
> 1863.235u 21.743s 8:21.32 376.0% 0+0k 0+0io 1068pf+0w
>> lapw1 -c -up -p (16:58:17) starting parallel lapw1 at Sat Jan 30
>> 16:58:18 CET 2010
> -> starting parallel LAPW1 jobs at Sat Jan 30 16:58:18 CET 2010
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
> mn117.mpi mn117.mpi mn117.mpi mn117.mpi(1) 1263.782u 28.214s 36:47.58
> 58.5% 0+0k 0+0io 49300pf+0w
> ** LAPW1 crashed!
> 1266.358u 37.286s 36:53.31 58.8% 0+0k 0+0io 49425pf+0w
> error: command /disk/global/home/eishfh/Wien2k_09_2/lapw1cpara -up -c
> uplapw1.def failed
>
> Error file:
>
> LAPW0 END
> LAPW0 END
> LAPW0 END
> LAPW0 END
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 8837 on node mn117.local exited
> on signal 9 (Killed).
>
> stacksize:
>
> [eishfh at milleotto s110]$ ulimit -a
>
> file locks (-x) unlimited
>
>
--
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Electron crystallography is the branch of science that uses electron
scattering and imaging to study the structure of matter.
More information about the Wien
mailing list