<p>Most probably you have a problem in the initial setup, less likely is an mpi problem.</p>
<p>Please verify first that this particular case will run in non mpi mode, k-point parallel.</p>
<p>If it does, please check the mail list history for openmpi. You need to compile correctly, use a recent enough openmpi version and avoid issues with openmpi not exporting evironmental parameters.</p>
<div class="gmail_quote">On Jan 18, 2011 5:07 PM, "Laurent CHAPUT" <<a href="mailto:Laurent.Chaput@ijl.nancy-universite.fr">Laurent.Chaput@ijl.nancy-universite.fr</a>> wrote:<br type="attribution">> Dear Wien2k users<br>
> <br>> I am experiencing some problems trying to run an mpi calculation on our cluster. I am using the version 10.1 (Release 7/6/2010) with openmi and the intel compiler. I end up with errors in the dayfile and in the error file (see below).<br>
> Here is my .machines file :<br>> <br>> lapw0:node046.cm.cluster node046.cm.cluster node046.cm.cluster node046.cm.cluster<br>> 1:node046.cm.cluster node046.cm.cluster node046.cm.cluster node046.cm.cluster<br>
> granularity:1<br>> extrafine:1<br>> <br>> I would appreciate any help.<br>> Thanks in advance,<br>> L. Chaput<br>> <br>> <br>>> lapw0 -p (23:43:38) starting parallel lapw0 at Tue Jan 18 23:43:38 CET 2011<br>
> -------- .machine0 : 4 processors<br>> 3.906u 0.165s 0:02.14 189.7% 0+0k 0+0io 24pf+0w<br>>> lapw1 -p (23:43:40) starting parallel lapw1 at Tue Jan 18 23:43:40 CET 2011<br>> -> starting parallel LAPW1 jobs at Tue Jan 18 23:43:40 CET 2011<br>
> Tue Jan 18 23:43:40 CET 2011 -> Setting up case bi for parallel execution<br>> Tue Jan 18 23:43:40 CET 2011 -> of LAPW1<br>> Tue Jan 18 23:43:40 CET 2011 -><br>> running LAPW1 in parallel mode (using .machines)<br>
> Granularity set to 1<br>> Extrafine set<br>> Tue Jan 18 23:43:40 CET 2011 -> klist: 116<br>> Tue Jan 18 23:43:40 CET 2011 -> machines: node046.cm.cluster node046.cm.cluster node046.cm.cluster node046.cm.cluster<br>
> Tue Jan 18 23:43:40 CET 2011 -> procs: 1<br>> Tue Jan 18 23:43:40 CET 2011 -> weigh(old): 1<br>> Tue Jan 18 23:43:40 CET 2011 -> sumw: 1<br>> Tue Jan 18 23:43:40 CET 2011 -> granularity: 1<br>
> Tue Jan 18 23:43:40 CET 2011 -> weigh(new): 116<br>> Tue Jan 18 23:43:40 CET 2011 -> Splitting bi.klist.tmp into junks<br>> .machinetmp<br>> 1 number_of_parallel_jobs<br>> prepare 1 on node046.cm.cluster<br>
> Tue Jan 18 23:43:40 CET 2011 -> Creating klist 1<br>> waiting for all processes to complete<br>> Tue Jan 18 23:43:42 CET 2011 -> all processes done.<br>> Tue Jan 18 23:43:43 CET 2011 -> CPU TIME summary:<br>
> Tue Jan 18 23:43:43 CET 2011 -> ================<br>> node046.cm.cluster node046.cm.cluster node046.cm.cluster node046.cm.cluster(116) Child id 3 SIGSEGV, contact developers<br>> Child id 1 SIGSEGV, contact developers<br>
> Child id 2 SIGSEGV, contact developers<br>> Child id 0 SIGSEGV, contact developers<br>> 0.080u 0.077s 0:01.13 13.2% 0+0k 0+0io 16pf+0w<br>> Summary of lapw1para:<br>> node046.cm.cluster k=0 user=0 wallclock=6960<br>
> 0.122u 0.397s 0:03.22 15.8% 0+0k 0+0io 16pf+0w<br>>> lapw2 -p (23:43:43) running LAPW2 in parallel mode<br>> ** LAPW2 crashed!<br>> 0.029u 0.085s 0:00.12 83.3% 0+0k 0+0io 0pf+0w<br>> error: command /CALCULS/lchaput/code/wien2k/lapw2para lapw2.def failed<br>
> <br>>> stop error<br>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++<br>> And this in the error file.<br>> <br>> LAPW0 END<br>> LAPW0 END<br>> LAPW0 END<br>> LAPW0 END<br>
> .machinetmp222: No such file or directory<br>> w2k_dispatch_signal(): received: Segmentation fault<br>> w2k_dispatch_signal(): received: Segmentation fault<br>> w2k_dispatch_signal(): received: Segmentation fault<br>
> w2k_dispatch_signal(): received: Segmentation fault<br>> --------------------------------------------------------------------------<br>> MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD<br>> with errorcode 8292600.<br>
> <br>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.<br>> You may or may not see output from other processes, depending on<br>> exactly when Open MPI kills them.<br>> --------------------------------------------------------------------------<br>
> --------------------------------------------------------------------------<br>> mpirun has exited due to process rank 1 with PID 24638 on<br>> node node046 exiting without calling "finalize". This may<br>
> have caused other processes in the application to be<br>> terminated by signals sent by mpirun (as reported here).<br>> --------------------------------------------------------------------------<br>> [node046:24635] 3 more processes have sent help message help-mpi-api.txt / mpi-abort<br>
> [node046:24635] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages<br>> w2k_dispatch_signal(): received: Terminated<br>> bi.scf1_1: No such file or directory.<br>> FERMI - Error<br>
> cp: cannot stat `.in.tmp': No such file or directory<br>> rm: cannot remove `.in.tmp': No such file or directory<br>> rm: cannot remove `.in.tmp1': No such file or directory<br>> _______________________________________________<br>
> Wien mailing list<br>> <a href="mailto:Wien@zeus.theochem.tuwien.ac.at">Wien@zeus.theochem.tuwien.ac.at</a><br>> <a href="http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien">http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien</a><br>
</div>