[Wien] lapw2 crash
Peter Blaha
pblaha at theochem.tuwien.ac.at
Mon Dec 22 14:08:49 CET 2008
No, WIEN should not do ssh when USE_REMOTE 0
Try "x lapw2 -p -up"
Check the lapw2para_lapw script-
You may also add -x in the header of lapw2para_lapw. This will give a long
debug output of the shell script starting the fortran jobs.
Actually, lapw2para calls first
lapw2 uplapw2.def N with the "-fermi" option, i.e. no mpi-binary
and only then it would start the lapw2_mpi executables.
Scott Beardsley schrieb:
> I haven't seen any activity on this thread so I'll ask in a different
> way... I suspect that this might be related to my integration with grid
> engine (or openmpi). We are using OpenMPI with tight integration with
> Grid Engine. This means that mpi-enabled binaries are actually started
> by grid engine not ssh. I've noticed that when I use qlogin (an
> interactive session) I can get the run to complete successfully but when
> I use qsub (a non-interactive session) I get a failure at LAPW2. I'm not
> sure but I suspect there are some environment variables that are missing
> when I use qsub or that wien is ssh'ing to the remote node manually.
> Here is my parallel_options file:
>
> setenv USE_REMOTE 0
> setenv WIEN_GRANULARITY 1
> setenv WIEN_MPIRUN "mpirun _EXEC_"
>
> Here is my .machines file:
>
> lapw0: icompute-1-35:4 icompute-1-36:4 icompute-3-23:4 icompute-3-40:4
> 1: icompute-1-35:4 icompute-1-36:4 icompute-3-23:4 icompute-3-40:4
> granularity:1
> extrafine:1
>
> Here is my dayfile:
>
>
> Calculating LSDA_2 in /home/sbeards/LSDA_2
> on icompute-3-23.local with PID 17474
>
> start (Fri Dec 19 14:38:08 PST 2008) with lapw0 (40/99 to go)
>
> cycle 1 (Fri Dec 19 14:38:08 PST 2008) (40/99 to go)
>
> > lapw0 -p (14:38:08) starting parallel lapw0 at Fri Dec 19 14:38:09
> PST 2008
> Fri Dec 19 14:38:09 PST 2008 -> Setting up case LSDA_2 for parallel
> execution
> Fri Dec 19 14:38:09 PST 2008 -> of lapw0
> Fri Dec 19 14:38:09 PST 2008 ->
> -------- .machine0 : 16 processors
> icompute-1-35:4 icompute-1-36:4 icompute-3-23:4 icompute-3-40:4
> LOADING icompute-3-23.local:.bashrc
> LOADING icompute-3-40.local:.bashrc
> LOADING icompute-1-36.local:.bashrc
> LOADING icompute-1-35.local:.bashrc
> Fri Dec 19 14:38:19 PST 2008 -> all processes done.
> Fri Dec 19 14:38:20 PST 2008 -> CPU TIME summary:
> Fri Dec 19 14:38:20 PST 2008 -> ================
> 0.064u 0.351s 0:11.00 3.7% 0+0k 0+0io 15pf+0w
> > lapw1 -up -p (14:38:20) starting parallel lapw1 at Fri Dec 19
> 14:38:21 PST 2008
> -> starting parallel LAPW1 jobs at Fri Dec 19 14:38:21 PST 2008
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
> icompute-1-35 icompute-1-35 icompute-1-35 icompute-1-35
> icompute-1-36 icompute-1-36 icompute-1-36 icompute-1-36 icompute-3-23
> icompute-3-23 icompute-3-23 icompute-3-23 icompute-3-40 icompute-3-40
> icompute-3-40 icompute-3-40(59) LOADING icompute-3-23.local:.bashrc
> LOADING icompute-3-40.local:.bashrc
> LOADING icompute-1-35.local:.bashrc
> LOADING icompute-1-36.local:.bashrc
> Using 16 processors
> scalapack processors array (row,col): 4 4
> 0.030u 0.036s 0:28.39 0.2% 0+0k 0+0io 0pf+0w
> Summary of lapw1para:
> icompute-1-35 k=0 user=0 wallclock=0
> 0.169u 0.832s 0:33.48 2.9% 0+0k 0+0io 5pf+0w
> > lapw1 -dn -p (14:38:54) starting parallel lapw1 at Fri Dec 19
> 14:38:55 PST 2008
> -> starting parallel LAPW1 jobs at Fri Dec 19 14:38:55 PST 2008
> running LAPW1 in parallel mode (using .machines.help)
> 1 number_of_parallel_jobs
> icompute-1-35 icompute-1-35 icompute-1-35 icompute-1-35
> icompute-1-36 icompute-1-36 icompute-1-36 icompute-1-36 icompute-3-23
> icompute-3-23 icompute-3-23 icompute-3-23 icompute-3-40 icompute-3-40
> icompute-3-40 icompute-3-40(59) LOADING icompute-3-23.local:.bashrc
> LOADING icompute-3-40.local:.bashrc
> LOADING icompute-1-35.local:.bashrc
> LOADING icompute-1-36.local:.bashrc
> Using 16 processors
> scalapack processors array (row,col): 4 4
> 0.035u 0.039s 0:24.91 0.2% 0+0k 0+0io 0pf+0w
> Summary of lapw1para:
> icompute-1-35 k=0 user=0 wallclock=0
> 0.190u 0.882s 0:30.93 3.4% 0+0k 0+0io 0pf+0w
> > lapw2 -up -p (14:39:26) running LAPW2 in parallel mode
> ** LAPW2 crashed!
> 0.147u 0.395s 0:08.87 5.9% 0+0k 0+0io 1pf+0w
> error: command /share/apps/wien-2k_08/lapw2para -up uplapw2.def failed
>
> > stop error
>
> Any pointers would be greatly appreciated. For example... can I just run
> lapw2 manually? Is there a verbose logging option?
>
> Scott
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
--
P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-15671 FAX: +43-1-58801-15698
Email: blaha at theochem.tuwien.ac.at WWW: http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------
More information about the Wien
mailing list