<html>
<head>
<style><!--
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
font-size: 10pt;
font-family:Verdana
}
--></style>
</head>
<body class='hmmessage'>
Marks,<br><br> Thanks again for your quick reply. You are probably right that its a memory problem since <br>the system I am using for testing my jobs has very low memory (only 1GB per processor).<br>I will try to run the job in a better machine (4GB per processor) that is available in our system. <br><br>Best,<br>Fhokrul<br><br><br>> Date: Sat, 30 Jan 2010 16:11:07 -0600<br>> From: L-marks@northwestern.edu<br>> To: wien@zeus.theochem.tuwien.ac.at<br>> Subject: Re: [Wien] Fwd: MPI segmentation fault<br>> <br>> OK, looks like you have cleaned up many of the issues. The SIGSEV is<br>> (I think) now one of two things:<br>> <br>> a) memory limitations (how much do you have, 8Gb or 16-24 Gb ?)<br>> <br>> While the process is running do a "top" and see how much memory is<br>> allocated and whether this is essentially all. If you have ganglia<br>> available you can use this to see readily. Similar information is also<br>> available in cat /proc/meminfo or using the nmon utility from IBM<br>> (google it, it is easy to compile). I suspect that you are simply<br>> running out of memory, running too many tasks at the same time on one<br>> machine -- you would need to use more machines so the memory usage on<br>> any one is smaller.<br>> <br>> b) stacksize issue (less likely)<br>> <br>> This is an issue with openmpi, see<br>> http://www.open-mpi.org/community/lists/users/2008/09/6491.php . In a<br>> nutshell, the stacksize limit is not an environmental parameter and<br>> there is no direct way to set it correctly with openmpi except to use<br>> a wrapper. I have a patch for this, but lets' try something simpler<br>> first (which I think is OK, but I might have it slightly wrong).<br>> <br>> * Create a file called wrap.sh in your search path (e.g. ~/bin or even<br>> $WIENROOT) and put in it<br>> #!/bin/bash<br>> source $HOME/.bashrc<br>> ulimit -s unlimited<br>> #write a line so we know we got here<br>> echo "Hello Fhorkul"<br>> $1 $2 $3 $4<br>> <br>> * Do a "chmod a+x wrap.sh" (appropriate location of course)<br>> <br>> * Edit parallel_options in $WIENROOT so it reads<br>> setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH -np _NP_<br>> -machinefile _HOSTS_ wrap.sh _EXEC_"<br>> <br>> This does the same as is described in the email link above, forcing<br>> the Wien2k mpi commands to be executed from within a bash shell so<br>> parameters are setup. If this works then I can provide details for a<br>> more general patch.<br>> <br>> <br>> 2010/1/30 Md. Fhokrul Islam <fislam@hotmail.com>:<br>> > Hi Marks,<br>> ><br>> > I have followed your suggestions and have used openmpi 1.4.1 compiled<br>> > with icc.<br>> > I also have compiled fftw with cc instead of gcc and recompiled Wien2k with<br>> > mpirun option<br>> > in parallel_options:<br>> ><br>> > current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_ -x<br>> > LD_LIBRARY_PATH<br>> ><br>> > Although I didn't get segmentation fault but the job still crashes at lapw1<br>> > with a different error<br>> > message. I have pasted case.dayfile and case.error below along with<br>> > ompi_info and stacksize<br>> > info. I am not even sure where to look for the solution. Please let me know<br>> > if you have any<br>> > suggestions regarding this MPI problem.<br>> ><br>> > Thanks,<br>> > Fhokrul<br>> ><br>> > case.dayfile:<br>> ><br>> > cycle 1 (Sat Jan 30 16:49:55 CET 2010) (200/99 to go)<br>> ><br>> >> lapw0 -p (16:49:55) starting parallel lapw0 at Sat Jan 30 16:49:56<br>> >> CET 2010<br>> > -------- .machine0 : 4 processors<br>> > 1863.235u 21.743s 8:21.32 376.0% 0+0k 0+0io 1068pf+0w<br>> >> lapw1 -c -up -p (16:58:17) starting parallel lapw1 at Sat Jan 30<br>> >> 16:58:18 CET 2010<br>> > -> starting parallel LAPW1 jobs at Sat Jan 30 16:58:18 CET 2010<br>> > running LAPW1 in parallel mode (using .machines)<br>> > 1 number_of_parallel_jobs<br>> > mn117.mpi mn117.mpi mn117.mpi mn117.mpi(1) 1263.782u 28.214s 36:47.58<br>> > 58.5% 0+0k 0+0io 49300pf+0w<br>> > ** LAPW1 crashed!<br>> > 1266.358u 37.286s 36:53.31 58.8% 0+0k 0+0io 49425pf+0w<br>> > error: command /disk/global/home/eishfh/Wien2k_09_2/lapw1cpara -up -c<br>> > uplapw1.def failed<br>> ><br>> > Error file:<br>> ><br>> > LAPW0 END<br>> > LAPW0 END<br>> > LAPW0 END<br>> > LAPW0 END<br>> > --------------------------------------------------------------------------<br>> > mpirun noticed that process rank 0 with PID 8837 on node mn117.local exited<br>> > on signal 9 (Killed).<br>> ><br>> > stacksize:<br>> ><br>> > [eishfh@milleotto s110]$ ulimit -a<br>> ><br>> > file locks (-x) unlimited<br>> ><br>> ><br>> -- <br>> Laurence Marks<br>> Department of Materials Science and Engineering<br>> MSE Rm 2036 Cook Hall<br>> 2220 N Campus Drive<br>> Northwestern University<br>> Evanston, IL 60208, USA<br>> Tel: (847) 491-3996 Fax: (847) 491-7820<br>> email: L-marks at northwestern dot edu<br>> Web: www.numis.northwestern.edu<br>> Chair, Commission on Electron Crystallography of IUCR<br>> www.numis.northwestern.edu/<br>> Electron crystallography is the branch of science that uses electron<br>> scattering and imaging to study the structure of matter.<br>> _______________________________________________<br>> Wien mailing list<br>> Wien@zeus.theochem.tuwien.ac.at<br>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien<br>                                            <br /><hr />Your E-mail and More On-the-Go. Get Windows Live Hotmail Free. <a href='https://signup.live.com/signup.aspx?id=60969' target='_new'>Sign up now.</a></body>
</html>