[Wien] Fwd: MPI segmentation fault
Md. Fhokrul Islam
fislam at hotmail.com
Sun Jan 31 00:25:58 CET 2010
Marks,
Thanks again for your quick reply. You are probably right that its a memory problem since
the system I am using for testing my jobs has very low memory (only 1GB per processor).
I will try to run the job in a better machine (4GB per processor) that is available in our system.
Best,
Fhokrul
> Date: Sat, 30 Jan 2010 16:11:07 -0600
> From: L-marks at northwestern.edu
> To: wien at zeus.theochem.tuwien.ac.at
> Subject: Re: [Wien] Fwd: MPI segmentation fault
>
> OK, looks like you have cleaned up many of the issues. The SIGSEV is
> (I think) now one of two things:
>
> a) memory limitations (how much do you have, 8Gb or 16-24 Gb ?)
>
> While the process is running do a "top" and see how much memory is
> allocated and whether this is essentially all. If you have ganglia
> available you can use this to see readily. Similar information is also
> available in cat /proc/meminfo or using the nmon utility from IBM
> (google it, it is easy to compile). I suspect that you are simply
> running out of memory, running too many tasks at the same time on one
> machine -- you would need to use more machines so the memory usage on
> any one is smaller.
>
> b) stacksize issue (less likely)
>
> This is an issue with openmpi, see
> http://www.open-mpi.org/community/lists/users/2008/09/6491.php . In a
> nutshell, the stacksize limit is not an environmental parameter and
> there is no direct way to set it correctly with openmpi except to use
> a wrapper. I have a patch for this, but lets' try something simpler
> first (which I think is OK, but I might have it slightly wrong).
>
> * Create a file called wrap.sh in your search path (e.g. ~/bin or even
> $WIENROOT) and put in it
> #!/bin/bash
> source $HOME/.bashrc
> ulimit -s unlimited
> #write a line so we know we got here
> echo "Hello Fhorkul"
> $1 $2 $3 $4
>
> * Do a "chmod a+x wrap.sh" (appropriate location of course)
>
> * Edit parallel_options in $WIENROOT so it reads
> setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH -np _NP_
> -machinefile _HOSTS_ wrap.sh _EXEC_"
>
> This does the same as is described in the email link above, forcing
> the Wien2k mpi commands to be executed from within a bash shell so
> parameters are setup. If this works then I can provide details for a
> more general patch.
>
>
> 2010/1/30 Md. Fhokrul Islam <fislam at hotmail.com>:
> > Hi Marks,
> >
> > I have followed your suggestions and have used openmpi 1.4.1 compiled
> > with icc.
> > I also have compiled fftw with cc instead of gcc and recompiled Wien2k with
> > mpirun option
> > in parallel_options:
> >
> > current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_ -x
> > LD_LIBRARY_PATH
> >
> > Although I didn't get segmentation fault but the job still crashes at lapw1
> > with a different error
> > message. I have pasted case.dayfile and case.error below along with
> > ompi_info and stacksize
> > info. I am not even sure where to look for the solution. Please let me know
> > if you have any
> > suggestions regarding this MPI problem.
> >
> > Thanks,
> > Fhokrul
> >
> > case.dayfile:
> >
> > cycle 1 (Sat Jan 30 16:49:55 CET 2010) (200/99 to go)
> >
> >> lapw0 -p (16:49:55) starting parallel lapw0 at Sat Jan 30 16:49:56
> >> CET 2010
> > -------- .machine0 : 4 processors
> > 1863.235u 21.743s 8:21.32 376.0% 0+0k 0+0io 1068pf+0w
> >> lapw1 -c -up -p (16:58:17) starting parallel lapw1 at Sat Jan 30
> >> 16:58:18 CET 2010
> > -> starting parallel LAPW1 jobs at Sat Jan 30 16:58:18 CET 2010
> > running LAPW1 in parallel mode (using .machines)
> > 1 number_of_parallel_jobs
> > mn117.mpi mn117.mpi mn117.mpi mn117.mpi(1) 1263.782u 28.214s 36:47.58
> > 58.5% 0+0k 0+0io 49300pf+0w
> > ** LAPW1 crashed!
> > 1266.358u 37.286s 36:53.31 58.8% 0+0k 0+0io 49425pf+0w
> > error: command /disk/global/home/eishfh/Wien2k_09_2/lapw1cpara -up -c
> > uplapw1.def failed
> >
> > Error file:
> >
> > LAPW0 END
> > LAPW0 END
> > LAPW0 END
> > LAPW0 END
> > --------------------------------------------------------------------------
> > mpirun noticed that process rank 0 with PID 8837 on node mn117.local exited
> > on signal 9 (Killed).
> >
> > stacksize:
> >
> > [eishfh at milleotto s110]$ ulimit -a
> >
> > file locks (-x) unlimited
> >
> >
> --
> Laurence Marks
> Department of Materials Science and Engineering
> MSE Rm 2036 Cook Hall
> 2220 N Campus Drive
> Northwestern University
> Evanston, IL 60208, USA
> Tel: (847) 491-3996 Fax: (847) 491-7820
> email: L-marks at northwestern dot edu
> Web: www.numis.northwestern.edu
> Chair, Commission on Electron Crystallography of IUCR
> www.numis.northwestern.edu/
> Electron crystallography is the branch of science that uses electron
> scattering and imaging to study the structure of matter.
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
_________________________________________________________________
Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
https://signup.live.com/signup.aspx?id=60969
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20100130/4078828a/attachment.htm>
More information about the Wien
mailing list