[Wien] lapw2 crash

Scott Beardsley scott at cse.ucdavis.edu
Fri Dec 19 23:43:16 CET 2008


I haven't seen any activity on this thread so I'll ask in a different 
way... I suspect that this might be related to my integration with grid 
engine (or openmpi). We are using OpenMPI with tight integration with 
Grid Engine. This means that mpi-enabled binaries are actually started 
by grid engine not ssh. I've noticed that when I use qlogin (an 
interactive session) I can get the run to complete successfully but when 
I use qsub (a non-interactive session) I get a failure at LAPW2. I'm not 
sure but I suspect there are some environment variables that are missing 
when I use qsub or that wien is ssh'ing to the remote node manually. 
Here is my parallel_options file:

setenv USE_REMOTE 0
setenv WIEN_GRANULARITY 1
setenv WIEN_MPIRUN "mpirun _EXEC_"

Here is my .machines file:

lapw0: icompute-1-35:4 icompute-1-36:4 icompute-3-23:4 icompute-3-40:4
1: icompute-1-35:4 icompute-1-36:4 icompute-3-23:4 icompute-3-40:4
granularity:1
extrafine:1

Here is my dayfile:


Calculating LSDA_2 in /home/sbeards/LSDA_2
on icompute-3-23.local with PID 17474

     start 	(Fri Dec 19 14:38:08 PST 2008) with lapw0 (40/99 to go)

     cycle 1 	(Fri Dec 19 14:38:08 PST 2008) 	(40/99 to go)

 >   lapw0 -p	(14:38:08) starting parallel lapw0 at Fri Dec 19 14:38:09 
PST 2008
Fri Dec 19 14:38:09 PST 2008 -> Setting up case LSDA_2 for parallel 
execution
Fri Dec 19 14:38:09 PST 2008 -> of lapw0
Fri Dec 19 14:38:09 PST 2008 ->
-------- .machine0 : 16 processors
icompute-1-35:4 icompute-1-36:4 icompute-3-23:4 icompute-3-40:4
LOADING icompute-3-23.local:.bashrc
LOADING icompute-3-40.local:.bashrc
LOADING icompute-1-36.local:.bashrc
LOADING icompute-1-35.local:.bashrc
Fri Dec 19 14:38:19 PST 2008 -> all processes done.
Fri Dec 19 14:38:20 PST 2008 -> CPU TIME summary:
Fri Dec 19 14:38:20 PST 2008 -> ================
0.064u 0.351s 0:11.00 3.7%	0+0k 0+0io 15pf+0w
 >   lapw1  -up -p  	(14:38:20) starting parallel lapw1 at Fri Dec 19 
14:38:21 PST 2008
->  starting parallel LAPW1 jobs at Fri Dec 19 14:38:21 PST 2008
running LAPW1 in parallel mode (using .machines)
1 number_of_parallel_jobs
      icompute-1-35 icompute-1-35 icompute-1-35 icompute-1-35 
icompute-1-36 icompute-1-36 icompute-1-36 icompute-1-36 icompute-3-23 
icompute-3-23 icompute-3-23 icompute-3-23 icompute-3-40 icompute-3-40 
icompute-3-40 icompute-3-40(59) LOADING icompute-3-23.local:.bashrc
LOADING icompute-3-40.local:.bashrc
LOADING icompute-1-35.local:.bashrc
LOADING icompute-1-36.local:.bashrc
Using   16 processors
scalapack processors array (row,col):   4   4
0.030u 0.036s 0:28.39 0.2%	0+0k 0+0io 0pf+0w
    Summary of lapw1para:
    icompute-1-35	 k=0	 user=0	 wallclock=0
0.169u 0.832s 0:33.48 2.9%	0+0k 0+0io 5pf+0w
 >   lapw1  -dn -p  	(14:38:54) starting parallel lapw1 at Fri Dec 19 
14:38:55 PST 2008
->  starting parallel LAPW1 jobs at Fri Dec 19 14:38:55 PST 2008
running LAPW1 in parallel mode (using .machines.help)
1 number_of_parallel_jobs
      icompute-1-35 icompute-1-35 icompute-1-35 icompute-1-35 
icompute-1-36 icompute-1-36 icompute-1-36 icompute-1-36 icompute-3-23 
icompute-3-23 icompute-3-23 icompute-3-23 icompute-3-40 icompute-3-40 
icompute-3-40 icompute-3-40(59) LOADING icompute-3-23.local:.bashrc
LOADING icompute-3-40.local:.bashrc
LOADING icompute-1-35.local:.bashrc
LOADING icompute-1-36.local:.bashrc
Using   16 processors
scalapack processors array (row,col):   4   4
0.035u 0.039s 0:24.91 0.2%	0+0k 0+0io 0pf+0w
    Summary of lapw1para:
    icompute-1-35	 k=0	 user=0	 wallclock=0
0.190u 0.882s 0:30.93 3.4%	0+0k 0+0io 0pf+0w
 >   lapw2 -up -p 	(14:39:26) running LAPW2 in parallel mode
**  LAPW2 crashed!
0.147u 0.395s 0:08.87 5.9%	0+0k 0+0io 1pf+0w
error: command   /share/apps/wien-2k_08/lapw2para -up uplapw2.def   failed

 >   stop error

Any pointers would be greatly appreciated. For example... can I just run 
lapw2 manually? Is there a verbose logging option?

Scott


More information about the Wien mailing list