[Wien] lapw1_mpi - Segmentation fault

Brian R Smith brian at cypher.acomp.usf.edu
Thu May 26 18:34:52 CEST 2005


Hi all:

I've just built Wien2k05 on our Sun v880 using the following build
parameters:

FC = /opt/SUNWspro/bin/f90
MPF = /usr/local/v9b/mpi/bin/mpif90
CC = /opt/SUNWspro/bin/cc64
FOPT =  -Bstatic -fast -dalign -free -xarch=v9b
FPOPT = -Bstatic -fast -dalign -free -xarch=v9b
DParallel = '-DParallel'
FGEN = $(PARALLEL)
LDFLAGS = -L../SRC_lib -xarch=v9b -lmvec
R_LIBS = -xarch=v9b -xlic_lib=sunperf -lmvec
C_LIBS = $(R_LIBS)
RP_LIBS = $(R_LIBS) -L /usr/local/SCALAPACK -L /usr/local/BLACS/LIB -
lscalapack -lblacsF77init -lblacs -L/opt/SUNWspro/prod/lib/v9b -
xlic_lib=sunperf -lf77compat

I have built Wien2k for fine-grained parallel with MPICH and SCALAPACK.
All serial cases work well and initial testing showed that parallel
lapw1 worked with a few of my test cases (or the ones that I had
tested).  Later, I was told to run the Wien2k benchmark package from
this site http://www.wien2k.at/reg_user/benchmark/ in serial and in
parallel.  It's nothing more than a few input files meant to be called
with lapw1.  When I do this in serial, all is well, but when I run this
in parallel, lapw1_mpi seg-faults.  After running this benchmark, all of
my other cases now crash.  I do not know what state has been changed
since it did work, but now it is completely broken on all problem sets.

I'll try to provide as much information as possible (apologies if this
is quite long):

(brs at hornet)--(~/test_case)
 $ cat test_case.struct
GaN-w
H   LATTICE,NONEQUIV.ATOMS: 16156_P3m1
MODE OF CALC=RELA unit=bohr
 12.052678 12.052678 19.596468 90.000000 90.000000120.000000
ATOM  -1: X=0.00000000 Y=0.00000000 Z=0.81250000
          MULT= 1          ISPLIT= 4
Ga         NPT=  781  R0=0.00010000 RMT=    1.9500   Z: 31.0
LOCAL ROT MATRIX:    1.0000000 0.0000000 0.0000000
                     0.0000000 1.0000000 0.0000000
                     0.0000000 0.0000000 1.0000000
ATOM  -2: X=0.50000000 Y=0.00000000 Z=0.81250000
          MULT= 3          ISPLIT= 8
      -2: X=0.00000000 Y=0.50000000 Z=0.81250000
      -2: X=0.50000000 Y=0.50000000 Z=0.81250000
Ga         NPT=  781  R0=0.00010000 RMT=    1.9500   Z: 31.0
LOCAL ROT MATRIX:    0.0000000-0.5000000 0.8660254
                     0.0000000-0.8660254-0.5000000
                     1.0000000 0.0000000 0.0000000
ATOM  -3: X=0.00000000 Y=0.00000000 Z=0.31250000
          MULT= 1          ISPLIT= 4
Ga         NPT=  781  R0=0.00010000 RMT=    1.9500   Z: 31.0
LOCAL ROT MATRIX:    1.0000000 0.0000000 0.0000000
                     0.0000000 1.0000000 0.0000000
                     0.0000000 0.0000000 1.0000000
ATOM  -4: X=0.50000000 Y=0.00000000 Z=0.31250000
          MULT= 3          ISPLIT= 8
      -4: X=0.00000000 Y=0.50000000 Z=0.31250000
      -4: X=0.50000000 Y=0.50000000 Z=0.31250000
Ga         NPT=  781  R0=0.00010000 RMT=    1.9500   Z: 31.0
LOCAL ROT MATRIX:    0.0000000-0.5000000 0.8660254
                     0.0000000-0.8660254-0.5000000
                     1.0000000 0.0000000 0.0000000
ATOM  -5: X=0.33333333 Y=0.16666667 Z=0.06250000
          MULT= 3          ISPLIT= 8
      -5: X=0.83333333 Y=0.16666666 Z=0.06250000
      -5: X=0.83333333 Y=0.66666667 Z=0.06250000
Ga         NPT=  781  R0=0.00010000 RMT=    1.9500   Z: 31.0
LOCAL ROT MATRIX:    0.0000000 1.0000000 0.0000000
                     0.0000000 0.0000000 1.0000000
                     1.0000000 0.0000000 0.0000000
ATOM  -6: X=0.33333333 Y=0.66666667 Z=0.06250000
          MULT= 1          ISPLIT= 4
Ga         NPT=  781  R0=0.00010000 RMT=    1.9500   Z: 31.0
LOCAL ROT MATRIX:    1.0000000 0.0000000 0.0000000
                     0.0000000 1.0000000 0.0000000
                     0.0000000 0.0000000 1.0000000
ATOM  -7: X=0.33333333 Y=0.16666667 Z=0.56250000
          MULT= 3          ISPLIT= 8
      -7: X=0.83333333 Y=0.16666666 Z=0.56250000
      -7: X=0.83333333 Y=0.66666667 Z=0.56250000
Ga         NPT=  781  R0=0.00010000 RMT=    1.9500   Z: 31.0
LOCAL ROT MATRIX:    0.0000000 1.0000000 0.0000000
                     0.0000000 0.0000000 1.0000000
                     1.0000000 0.0000000 0.0000000
ATOM  -8: X=0.33333333 Y=0.66666667 Z=0.56250000
          MULT= 1          ISPLIT= 4
Ga         NPT=  781  R0=0.00010000 RMT=    1.9500   Z: 31.0
LOCAL ROT MATRIX:    1.0000000 0.0000000 0.0000000
                     0.0000000 1.0000000 0.0000000
                     0.0000000 0.0000000 1.0000000
ATOM  -9: X=0.00000000 Y=0.00000000 Z=0.00000000
          MULT= 1          ISPLIT= 4
N ch       NPT=  781  R0=0.00010000 RMT=    1.6500   Z:  7.0
LOCAL ROT MATRIX:    1.0000000 0.0000000 0.0000000
                     0.0000000 1.0000000 0.0000000
                     0.0000000 0.0000000 1.0000000
ATOM -10: X=0.50000000 Y=0.00000000 Z=0.00000000
          MULT= 3          ISPLIT= 8
     -10: X=0.00000000 Y=0.50000000 Z=0.00000000
     -10: X=0.50000000 Y=0.50000000 Z=0.00000000
N          NPT=  781  R0=0.00010000 RMT=    1.6500   Z:  7.0
LOCAL ROT MATRIX:    0.0000000-0.5000000 0.8660254
                     0.0000000-0.8660254-0.5000000
                     1.0000000 0.0000000 0.0000000
ATOM -11: X=0.00000000 Y=0.00000000 Z=0.50000000
          MULT= 1          ISPLIT= 4
N          NPT=  781  R0=0.00010000 RMT=    1.6500   Z:  7.0
LOCAL ROT MATRIX:    1.0000000 0.0000000 0.0000000
                     0.0000000 1.0000000 0.0000000
                     0.0000000 0.0000000 1.0000000
ATOM -12: X=0.50000000 Y=0.00000000 Z=0.50000000
          MULT= 3          ISPLIT= 8
     -12: X=0.00000000 Y=0.50000000 Z=0.50000000
     -12: X=0.50000000 Y=0.50000000 Z=0.50000000
N          NPT=  781  R0=0.00010000 RMT=    1.6500   Z:  7.0
LOCAL ROT MATRIX:    0.0000000-0.5000000 0.8660254
                     0.0000000-0.8660254-0.5000000
                     1.0000000 0.0000000 0.0000000
ATOM -13: X=0.33333333 Y=0.16666667 Z=0.25000000
          MULT= 3          ISPLIT= 8
     -13: X=0.83333333 Y=0.16666666 Z=0.25000000
     -13: X=0.83333333 Y=0.66666667 Z=0.25000000
N          NPT=  781  R0=0.00010000 RMT=    1.6500   Z:  7.0
LOCAL ROT MATRIX:    0.0000000 1.0000000 0.0000000
                     0.0000000 0.0000000 1.0000000
                     1.0000000 0.0000000 0.0000000
ATOM -14: X=0.33333333 Y=0.16666667 Z=0.75000000
          MULT= 3          ISPLIT= 8
     -14: X=0.83333333 Y=0.16666666 Z=0.75000000
     -14: X=0.83333333 Y=0.66666667 Z=0.75000000
N          NPT=  781  R0=0.00010000 RMT=    1.6500   Z:  7.0
LOCAL ROT MATRIX:    0.0000000 1.0000000 0.0000000
                     0.0000000 0.0000000 1.0000000
                     1.0000000 0.0000000 0.0000000
ATOM -15: X=0.33333333 Y=0.66666667 Z=0.25000000
          MULT= 1          ISPLIT= 4
N          NPT=  781  R0=0.00010000 RMT=    1.6500   Z:  7.0
LOCAL ROT MATRIX:    1.0000000 0.0000000 0.0000000
                     0.0000000 1.0000000 0.0000000
                     0.0000000 0.0000000 1.0000000
ATOM -16: X=0.33333333 Y=0.66666667 Z=0.75000000
          MULT= 1          ISPLIT= 4
N          NPT=  781  R0=0.00010000 RMT=    1.6500   Z:  7.0
LOCAL ROT MATRIX:    1.0000000 0.0000000 0.0000000
                     0.0000000 1.0000000 0.0000000
                     0.0000000 0.0000000 1.0000000
   6      NUMBER OF SYMMETRY OPERATIONS
 1 0 0 0.0000000
 0 1 0 0.0000000
 0 0 1 0.0000000
       1
 0-1 0 0.0000000
 1-1 0 0.0000000
 0 0 1 0.0000000
       2
-1 1 0 0.0000000
-1 0 0 0.0000000
 0 0 1 0.0000000
       3
 0-1 0 0.0000000
-1 0 0 0.0000000
 0 0 1 0.0000000
       4
-1 1 0 0.0000000
 0 1 0 0.0000000
 0 0 1 0.0000000
       5
 1 0 0 0.0000000
 1-1 0 0.0000000
 0 0 1 0.0000000
       6

(brs at hornet)--(~/test_case)
 $ cat lapw1.error
**  Error in Parallel LAPW1
**  LAPW1 STOPPED at Thu May 26 12:09:37 EDT 2005
**  check ERROR FILES!
Error in LAPW1

(brs at hornet)--(~/test_case)
 $ cat lapw1.def
 4,'test_case.klist',          'unknown','formatted',0
 5,'test_case.in1c',   'old',    'formatted',0
 6,'test_case.output1','unknown','formatted',0
10,'test_case.vector', 'unknown','unformatted',9000
11,'test_case.energy', 'unknown','formatted',0
18,'test_case.vsp',       'old',    'formatted',0
19,'test_case.vns',       'unknown','formatted',0
20,'test_case.struct',         'old',    'formatted',0
21,'test_case.scf1',   'unknown','formatted',0
55,'test_case.vec',            'unknown','formatted',0
71,'test_case.nsh',    'unknown','formatted',0

(brs at hornet)--(~/test_case)
 $ cat lapw1_1.error
Error in LAPW1

(brs at hornet)--(~/test_case)
 $ cat test_case.klist_1
         1    0    0    0   15  1.0 -7.0  1.5       100 k, div: (  5  5
3)
END

Now, the error report provided by default is IMHO quite inadequate so I
took the liberty of rolling through the myriads of shell scripts to
determine the exact cause of the problem:

I've added a -v to the #!/bin/csh directive in the following shell
scripts:

x, lapw1cpara

I've outlined the problem areas with *** and removed extraneous code for
brevity:

$ x lapw1 -c -p 

unalias rm

set running = ".running.$$.`hostname`.`date +%d%m%H%M%S`"
echo $$ > $running
onintr clear
alias error 'echo ">>> ($name) !* -> exit"; goto error'

set name = $0
set bin = /usr/local/wien2k-mpi
setenv WIENROOT "/usr/local/wien2k-mpi"
if ! ( -d $bin ) set bin = .
set name = $name:t

.
.
.

setenv USE_REMOTE 0
setenv WIEN_GRANULARITY 1
******************************************************************
setenv WIEN_MPIRUN "/usr/local/v9b/mpi/bin/mpirun -np _NP_ _EXEC_"
******************************************************************
endif

.
.
.

***********************************************
echo "** " Error in Parallel LAPW1 > $def.error
***********************************************

testinput .machines single
echo "starting parallel lapw1 at `date`"
starting parallel lapw1 at Thu May 26 12:27:45 EDT 2005
echo "starting parallel lapw1 at `date`" >> $log

.
.
.

set ttt= ( `echo $mpirun | sed -e "s^_NP_^$number_per_job[$p]^" -e
"s^_EXEC_^$WIENROOT/${exe}_mpi ${def}_$loop.def^" -e "s^_HOSTS_^.machine
[$p]^"` )
( cd $PWD ; $t $ttt ; rm -f .lock_$lockfile[$p] ) >> .time1_$loop &
[1] 1447
echo $t $ttt
***********************************************************
time /usr/local/v9b/mpi/bin/mpirun -np 4 /usr/local/wien2k-
mpi/lapw1c_mpi lapw1_1.def
***********************************************************
endif
jobs -l > .lapw1${cmplx}para.$$.`hostname`
endif
@ p ++

.
.
.

while ( $p < = $proc )
sleep $sleepy
*********************************************************************
Segmentation Fault - core dumped
[1]  + Done                 ( cd $PWD; $t $ttt; rm -f .lock_$lockfile
[$p] ) >> .time1_$loop
*********************************************************************

goto kloop
set p = 1
if ( $?residue && $?resok ) set p = 2
while ( $p < = $proc )

.
.
.

echo "** " LAPW1 crashed!
**  LAPW1 crashed!
echo "** " LAPW1 STOPPED at `date` >> $log
echo "** " check ERROR FILES! >> $log
echo "-----------------------------------------------------------------"
>> $logecho "** " Error in Parallel LAPW1 > $def.error
echo "** " LAPW1 STOPPED at `date` >> $def.error
echo "** " check ERROR FILES! >> $def.error
cat ${def}_*.error >> $def.error
echo "ERROR" > .lapw1para
rm $tmp $tmp2 > & /dev/null
rm .lapw1${cmplx}para.$$.`hostname` > & /dev/null
exit 1
19.0u 1.0s 0:09 205% 0+0k 0+0io 0pf+0w

clear:

if ( $?qtl || $?band || $?fermi || $?efg ) then
if ( -f $running ) rm $running
exit ( 0 )

+++++++++++++++++++++++++++++++++++++++++++++++++

One last test to verify my claim that lapw1_mpi is crashing:

$ /usr/local/v9b/mpi/bin/mpirun -np 4 /usr/local/wien2k-mpi/lapw1c_mpi
lapw1_1.def
 Using  4  processors, My ID =  1
 Using  4  processors, My ID =  0
 Using  4  processors, My ID =  2
 Using  4  processors, My ID =  3
Segmentation Fault

Any help at all would be greatly appreciated.

-Brian

-- 
--
Brian R. Smith
Systems Engineer
Research Computing Core Facility, USF
Phone: 1(813)974-1467   Cell: 1(813)230-3441
Address: 4202 E Fowler Ave LIB 613
         Tampa, FL 33620
Web: http://rccf.acomp.usf.edu




More information about the Wien mailing list