[Wien] lapw2 crash

Peter Blaha pblaha at theochem.tuwien.ac.at
Mon Dec 22 14:08:49 CET 2008


No, WIEN should not do ssh when USE_REMOTE 0

Try  "x lapw2 -p -up"

Check the lapw2para_lapw script-
You may also add -x in the header of lapw2para_lapw. This will give a long
debug output of the shell script starting the fortran jobs.

Actually, lapw2para calls first

lapw2 uplapw2.def N   with the "-fermi" option, i.e. no mpi-binary

and only then it would start the   lapw2_mpi executables.



Scott Beardsley schrieb:
> I haven't seen any activity on this thread so I'll ask in a different 
> way... I suspect that this might be related to my integration with grid 
> engine (or openmpi). We are using OpenMPI with tight integration with 
> Grid Engine. This means that mpi-enabled binaries are actually started 
> by grid engine not ssh. I've noticed that when I use qlogin (an 
> interactive session) I can get the run to complete successfully but when 
> I use qsub (a non-interactive session) I get a failure at LAPW2. I'm not 
> sure but I suspect there are some environment variables that are missing 
> when I use qsub or that wien is ssh'ing to the remote node manually. 
> Here is my parallel_options file:
> 
> setenv USE_REMOTE 0
> setenv WIEN_GRANULARITY 1
> setenv WIEN_MPIRUN "mpirun _EXEC_"
> 
> Here is my .machines file:
> 
> lapw0: icompute-1-35:4 icompute-1-36:4 icompute-3-23:4 icompute-3-40:4
> 1: icompute-1-35:4 icompute-1-36:4 icompute-3-23:4 icompute-3-40:4
> granularity:1
> extrafine:1
> 
> Here is my dayfile:
> 
> 
> Calculating LSDA_2 in /home/sbeards/LSDA_2
> on icompute-3-23.local with PID 17474
> 
>      start 	(Fri Dec 19 14:38:08 PST 2008) with lapw0 (40/99 to go)
> 
>      cycle 1 	(Fri Dec 19 14:38:08 PST 2008) 	(40/99 to go)
> 
>  >   lapw0 -p	(14:38:08) starting parallel lapw0 at Fri Dec 19 14:38:09 
> PST 2008
> Fri Dec 19 14:38:09 PST 2008 -> Setting up case LSDA_2 for parallel 
> execution
> Fri Dec 19 14:38:09 PST 2008 -> of lapw0
> Fri Dec 19 14:38:09 PST 2008 ->
> -------- .machine0 : 16 processors
> icompute-1-35:4 icompute-1-36:4 icompute-3-23:4 icompute-3-40:4
> LOADING icompute-3-23.local:.bashrc
> LOADING icompute-3-40.local:.bashrc
> LOADING icompute-1-36.local:.bashrc
> LOADING icompute-1-35.local:.bashrc
> Fri Dec 19 14:38:19 PST 2008 -> all processes done.
> Fri Dec 19 14:38:20 PST 2008 -> CPU TIME summary:
> Fri Dec 19 14:38:20 PST 2008 -> ================
> 0.064u 0.351s 0:11.00 3.7%	0+0k 0+0io 15pf+0w
>  >   lapw1  -up -p  	(14:38:20) starting parallel lapw1 at Fri Dec 19 
> 14:38:21 PST 2008
> ->  starting parallel LAPW1 jobs at Fri Dec 19 14:38:21 PST 2008
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
>       icompute-1-35 icompute-1-35 icompute-1-35 icompute-1-35 
> icompute-1-36 icompute-1-36 icompute-1-36 icompute-1-36 icompute-3-23 
> icompute-3-23 icompute-3-23 icompute-3-23 icompute-3-40 icompute-3-40 
> icompute-3-40 icompute-3-40(59) LOADING icompute-3-23.local:.bashrc
> LOADING icompute-3-40.local:.bashrc
> LOADING icompute-1-35.local:.bashrc
> LOADING icompute-1-36.local:.bashrc
> Using   16 processors
> scalapack processors array (row,col):   4   4
> 0.030u 0.036s 0:28.39 0.2%	0+0k 0+0io 0pf+0w
>     Summary of lapw1para:
>     icompute-1-35	 k=0	 user=0	 wallclock=0
> 0.169u 0.832s 0:33.48 2.9%	0+0k 0+0io 5pf+0w
>  >   lapw1  -dn -p  	(14:38:54) starting parallel lapw1 at Fri Dec 19 
> 14:38:55 PST 2008
> ->  starting parallel LAPW1 jobs at Fri Dec 19 14:38:55 PST 2008
> running LAPW1 in parallel mode (using .machines.help)
> 1 number_of_parallel_jobs
>       icompute-1-35 icompute-1-35 icompute-1-35 icompute-1-35 
> icompute-1-36 icompute-1-36 icompute-1-36 icompute-1-36 icompute-3-23 
> icompute-3-23 icompute-3-23 icompute-3-23 icompute-3-40 icompute-3-40 
> icompute-3-40 icompute-3-40(59) LOADING icompute-3-23.local:.bashrc
> LOADING icompute-3-40.local:.bashrc
> LOADING icompute-1-35.local:.bashrc
> LOADING icompute-1-36.local:.bashrc
> Using   16 processors
> scalapack processors array (row,col):   4   4
> 0.035u 0.039s 0:24.91 0.2%	0+0k 0+0io 0pf+0w
>     Summary of lapw1para:
>     icompute-1-35	 k=0	 user=0	 wallclock=0
> 0.190u 0.882s 0:30.93 3.4%	0+0k 0+0io 0pf+0w
>  >   lapw2 -up -p 	(14:39:26) running LAPW2 in parallel mode
> **  LAPW2 crashed!
> 0.147u 0.395s 0:08.87 5.9%	0+0k 0+0io 1pf+0w
> error: command   /share/apps/wien-2k_08/lapw2para -up uplapw2.def   failed
> 
>  >   stop error
> 
> Any pointers would be greatly appreciated. For example... can I just run 
> lapw2 manually? Is there a verbose logging option?
> 
> Scott
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-15671             FAX: +43-1-58801-15698
Email: blaha at theochem.tuwien.ac.at    WWW: http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------


More information about the Wien mailing list