[Wien] lapw1_mpi - Segmentation fault
Brian R Smith
brian at cypher.acomp.usf.edu
Thu May 26 18:34:52 CEST 2005
Hi all:
I've just built Wien2k05 on our Sun v880 using the following build
parameters:
FC = /opt/SUNWspro/bin/f90
MPF = /usr/local/v9b/mpi/bin/mpif90
CC = /opt/SUNWspro/bin/cc64
FOPT = -Bstatic -fast -dalign -free -xarch=v9b
FPOPT = -Bstatic -fast -dalign -free -xarch=v9b
DParallel = '-DParallel'
FGEN = $(PARALLEL)
LDFLAGS = -L../SRC_lib -xarch=v9b -lmvec
R_LIBS = -xarch=v9b -xlic_lib=sunperf -lmvec
C_LIBS = $(R_LIBS)
RP_LIBS = $(R_LIBS) -L /usr/local/SCALAPACK -L /usr/local/BLACS/LIB -
lscalapack -lblacsF77init -lblacs -L/opt/SUNWspro/prod/lib/v9b -
xlic_lib=sunperf -lf77compat
I have built Wien2k for fine-grained parallel with MPICH and SCALAPACK.
All serial cases work well and initial testing showed that parallel
lapw1 worked with a few of my test cases (or the ones that I had
tested). Later, I was told to run the Wien2k benchmark package from
this site http://www.wien2k.at/reg_user/benchmark/ in serial and in
parallel. It's nothing more than a few input files meant to be called
with lapw1. When I do this in serial, all is well, but when I run this
in parallel, lapw1_mpi seg-faults. After running this benchmark, all of
my other cases now crash. I do not know what state has been changed
since it did work, but now it is completely broken on all problem sets.
I'll try to provide as much information as possible (apologies if this
is quite long):
(brs at hornet)--(~/test_case)
$ cat test_case.struct
GaN-w
H LATTICE,NONEQUIV.ATOMS: 16156_P3m1
MODE OF CALC=RELA unit=bohr
12.052678 12.052678 19.596468 90.000000 90.000000120.000000
ATOM -1: X=0.00000000 Y=0.00000000 Z=0.81250000
MULT= 1 ISPLIT= 4
Ga NPT= 781 R0=0.00010000 RMT= 1.9500 Z: 31.0
LOCAL ROT MATRIX: 1.0000000 0.0000000 0.0000000
0.0000000 1.0000000 0.0000000
0.0000000 0.0000000 1.0000000
ATOM -2: X=0.50000000 Y=0.00000000 Z=0.81250000
MULT= 3 ISPLIT= 8
-2: X=0.00000000 Y=0.50000000 Z=0.81250000
-2: X=0.50000000 Y=0.50000000 Z=0.81250000
Ga NPT= 781 R0=0.00010000 RMT= 1.9500 Z: 31.0
LOCAL ROT MATRIX: 0.0000000-0.5000000 0.8660254
0.0000000-0.8660254-0.5000000
1.0000000 0.0000000 0.0000000
ATOM -3: X=0.00000000 Y=0.00000000 Z=0.31250000
MULT= 1 ISPLIT= 4
Ga NPT= 781 R0=0.00010000 RMT= 1.9500 Z: 31.0
LOCAL ROT MATRIX: 1.0000000 0.0000000 0.0000000
0.0000000 1.0000000 0.0000000
0.0000000 0.0000000 1.0000000
ATOM -4: X=0.50000000 Y=0.00000000 Z=0.31250000
MULT= 3 ISPLIT= 8
-4: X=0.00000000 Y=0.50000000 Z=0.31250000
-4: X=0.50000000 Y=0.50000000 Z=0.31250000
Ga NPT= 781 R0=0.00010000 RMT= 1.9500 Z: 31.0
LOCAL ROT MATRIX: 0.0000000-0.5000000 0.8660254
0.0000000-0.8660254-0.5000000
1.0000000 0.0000000 0.0000000
ATOM -5: X=0.33333333 Y=0.16666667 Z=0.06250000
MULT= 3 ISPLIT= 8
-5: X=0.83333333 Y=0.16666666 Z=0.06250000
-5: X=0.83333333 Y=0.66666667 Z=0.06250000
Ga NPT= 781 R0=0.00010000 RMT= 1.9500 Z: 31.0
LOCAL ROT MATRIX: 0.0000000 1.0000000 0.0000000
0.0000000 0.0000000 1.0000000
1.0000000 0.0000000 0.0000000
ATOM -6: X=0.33333333 Y=0.66666667 Z=0.06250000
MULT= 1 ISPLIT= 4
Ga NPT= 781 R0=0.00010000 RMT= 1.9500 Z: 31.0
LOCAL ROT MATRIX: 1.0000000 0.0000000 0.0000000
0.0000000 1.0000000 0.0000000
0.0000000 0.0000000 1.0000000
ATOM -7: X=0.33333333 Y=0.16666667 Z=0.56250000
MULT= 3 ISPLIT= 8
-7: X=0.83333333 Y=0.16666666 Z=0.56250000
-7: X=0.83333333 Y=0.66666667 Z=0.56250000
Ga NPT= 781 R0=0.00010000 RMT= 1.9500 Z: 31.0
LOCAL ROT MATRIX: 0.0000000 1.0000000 0.0000000
0.0000000 0.0000000 1.0000000
1.0000000 0.0000000 0.0000000
ATOM -8: X=0.33333333 Y=0.66666667 Z=0.56250000
MULT= 1 ISPLIT= 4
Ga NPT= 781 R0=0.00010000 RMT= 1.9500 Z: 31.0
LOCAL ROT MATRIX: 1.0000000 0.0000000 0.0000000
0.0000000 1.0000000 0.0000000
0.0000000 0.0000000 1.0000000
ATOM -9: X=0.00000000 Y=0.00000000 Z=0.00000000
MULT= 1 ISPLIT= 4
N ch NPT= 781 R0=0.00010000 RMT= 1.6500 Z: 7.0
LOCAL ROT MATRIX: 1.0000000 0.0000000 0.0000000
0.0000000 1.0000000 0.0000000
0.0000000 0.0000000 1.0000000
ATOM -10: X=0.50000000 Y=0.00000000 Z=0.00000000
MULT= 3 ISPLIT= 8
-10: X=0.00000000 Y=0.50000000 Z=0.00000000
-10: X=0.50000000 Y=0.50000000 Z=0.00000000
N NPT= 781 R0=0.00010000 RMT= 1.6500 Z: 7.0
LOCAL ROT MATRIX: 0.0000000-0.5000000 0.8660254
0.0000000-0.8660254-0.5000000
1.0000000 0.0000000 0.0000000
ATOM -11: X=0.00000000 Y=0.00000000 Z=0.50000000
MULT= 1 ISPLIT= 4
N NPT= 781 R0=0.00010000 RMT= 1.6500 Z: 7.0
LOCAL ROT MATRIX: 1.0000000 0.0000000 0.0000000
0.0000000 1.0000000 0.0000000
0.0000000 0.0000000 1.0000000
ATOM -12: X=0.50000000 Y=0.00000000 Z=0.50000000
MULT= 3 ISPLIT= 8
-12: X=0.00000000 Y=0.50000000 Z=0.50000000
-12: X=0.50000000 Y=0.50000000 Z=0.50000000
N NPT= 781 R0=0.00010000 RMT= 1.6500 Z: 7.0
LOCAL ROT MATRIX: 0.0000000-0.5000000 0.8660254
0.0000000-0.8660254-0.5000000
1.0000000 0.0000000 0.0000000
ATOM -13: X=0.33333333 Y=0.16666667 Z=0.25000000
MULT= 3 ISPLIT= 8
-13: X=0.83333333 Y=0.16666666 Z=0.25000000
-13: X=0.83333333 Y=0.66666667 Z=0.25000000
N NPT= 781 R0=0.00010000 RMT= 1.6500 Z: 7.0
LOCAL ROT MATRIX: 0.0000000 1.0000000 0.0000000
0.0000000 0.0000000 1.0000000
1.0000000 0.0000000 0.0000000
ATOM -14: X=0.33333333 Y=0.16666667 Z=0.75000000
MULT= 3 ISPLIT= 8
-14: X=0.83333333 Y=0.16666666 Z=0.75000000
-14: X=0.83333333 Y=0.66666667 Z=0.75000000
N NPT= 781 R0=0.00010000 RMT= 1.6500 Z: 7.0
LOCAL ROT MATRIX: 0.0000000 1.0000000 0.0000000
0.0000000 0.0000000 1.0000000
1.0000000 0.0000000 0.0000000
ATOM -15: X=0.33333333 Y=0.66666667 Z=0.25000000
MULT= 1 ISPLIT= 4
N NPT= 781 R0=0.00010000 RMT= 1.6500 Z: 7.0
LOCAL ROT MATRIX: 1.0000000 0.0000000 0.0000000
0.0000000 1.0000000 0.0000000
0.0000000 0.0000000 1.0000000
ATOM -16: X=0.33333333 Y=0.66666667 Z=0.75000000
MULT= 1 ISPLIT= 4
N NPT= 781 R0=0.00010000 RMT= 1.6500 Z: 7.0
LOCAL ROT MATRIX: 1.0000000 0.0000000 0.0000000
0.0000000 1.0000000 0.0000000
0.0000000 0.0000000 1.0000000
6 NUMBER OF SYMMETRY OPERATIONS
1 0 0 0.0000000
0 1 0 0.0000000
0 0 1 0.0000000
1
0-1 0 0.0000000
1-1 0 0.0000000
0 0 1 0.0000000
2
-1 1 0 0.0000000
-1 0 0 0.0000000
0 0 1 0.0000000
3
0-1 0 0.0000000
-1 0 0 0.0000000
0 0 1 0.0000000
4
-1 1 0 0.0000000
0 1 0 0.0000000
0 0 1 0.0000000
5
1 0 0 0.0000000
1-1 0 0.0000000
0 0 1 0.0000000
6
(brs at hornet)--(~/test_case)
$ cat lapw1.error
** Error in Parallel LAPW1
** LAPW1 STOPPED at Thu May 26 12:09:37 EDT 2005
** check ERROR FILES!
Error in LAPW1
(brs at hornet)--(~/test_case)
$ cat lapw1.def
4,'test_case.klist', 'unknown','formatted',0
5,'test_case.in1c', 'old', 'formatted',0
6,'test_case.output1','unknown','formatted',0
10,'test_case.vector', 'unknown','unformatted',9000
11,'test_case.energy', 'unknown','formatted',0
18,'test_case.vsp', 'old', 'formatted',0
19,'test_case.vns', 'unknown','formatted',0
20,'test_case.struct', 'old', 'formatted',0
21,'test_case.scf1', 'unknown','formatted',0
55,'test_case.vec', 'unknown','formatted',0
71,'test_case.nsh', 'unknown','formatted',0
(brs at hornet)--(~/test_case)
$ cat lapw1_1.error
Error in LAPW1
(brs at hornet)--(~/test_case)
$ cat test_case.klist_1
1 0 0 0 15 1.0 -7.0 1.5 100 k, div: ( 5 5
3)
END
Now, the error report provided by default is IMHO quite inadequate so I
took the liberty of rolling through the myriads of shell scripts to
determine the exact cause of the problem:
I've added a -v to the #!/bin/csh directive in the following shell
scripts:
x, lapw1cpara
I've outlined the problem areas with *** and removed extraneous code for
brevity:
$ x lapw1 -c -p
unalias rm
set running = ".running.$$.`hostname`.`date +%d%m%H%M%S`"
echo $$ > $running
onintr clear
alias error 'echo ">>> ($name) !* -> exit"; goto error'
set name = $0
set bin = /usr/local/wien2k-mpi
setenv WIENROOT "/usr/local/wien2k-mpi"
if ! ( -d $bin ) set bin = .
set name = $name:t
.
.
.
setenv USE_REMOTE 0
setenv WIEN_GRANULARITY 1
******************************************************************
setenv WIEN_MPIRUN "/usr/local/v9b/mpi/bin/mpirun -np _NP_ _EXEC_"
******************************************************************
endif
.
.
.
***********************************************
echo "** " Error in Parallel LAPW1 > $def.error
***********************************************
testinput .machines single
echo "starting parallel lapw1 at `date`"
starting parallel lapw1 at Thu May 26 12:27:45 EDT 2005
echo "starting parallel lapw1 at `date`" >> $log
.
.
.
set ttt= ( `echo $mpirun | sed -e "s^_NP_^$number_per_job[$p]^" -e
"s^_EXEC_^$WIENROOT/${exe}_mpi ${def}_$loop.def^" -e "s^_HOSTS_^.machine
[$p]^"` )
( cd $PWD ; $t $ttt ; rm -f .lock_$lockfile[$p] ) >> .time1_$loop &
[1] 1447
echo $t $ttt
***********************************************************
time /usr/local/v9b/mpi/bin/mpirun -np 4 /usr/local/wien2k-
mpi/lapw1c_mpi lapw1_1.def
***********************************************************
endif
jobs -l > .lapw1${cmplx}para.$$.`hostname`
endif
@ p ++
.
.
.
while ( $p < = $proc )
sleep $sleepy
*********************************************************************
Segmentation Fault - core dumped
[1] + Done ( cd $PWD; $t $ttt; rm -f .lock_$lockfile
[$p] ) >> .time1_$loop
*********************************************************************
goto kloop
set p = 1
if ( $?residue && $?resok ) set p = 2
while ( $p < = $proc )
.
.
.
echo "** " LAPW1 crashed!
** LAPW1 crashed!
echo "** " LAPW1 STOPPED at `date` >> $log
echo "** " check ERROR FILES! >> $log
echo "-----------------------------------------------------------------"
>> $logecho "** " Error in Parallel LAPW1 > $def.error
echo "** " LAPW1 STOPPED at `date` >> $def.error
echo "** " check ERROR FILES! >> $def.error
cat ${def}_*.error >> $def.error
echo "ERROR" > .lapw1para
rm $tmp $tmp2 > & /dev/null
rm .lapw1${cmplx}para.$$.`hostname` > & /dev/null
exit 1
19.0u 1.0s 0:09 205% 0+0k 0+0io 0pf+0w
clear:
if ( $?qtl || $?band || $?fermi || $?efg ) then
if ( -f $running ) rm $running
exit ( 0 )
+++++++++++++++++++++++++++++++++++++++++++++++++
One last test to verify my claim that lapw1_mpi is crashing:
$ /usr/local/v9b/mpi/bin/mpirun -np 4 /usr/local/wien2k-mpi/lapw1c_mpi
lapw1_1.def
Using 4 processors, My ID = 1
Using 4 processors, My ID = 0
Using 4 processors, My ID = 2
Using 4 processors, My ID = 3
Segmentation Fault
Any help at all would be greatly appreciated.
-Brian
--
--
Brian R. Smith
Systems Engineer
Research Computing Core Facility, USF
Phone: 1(813)974-1467 Cell: 1(813)230-3441
Address: 4202 E Fowler Ave LIB 613
Tampa, FL 33620
Web: http://rccf.acomp.usf.edu
More information about the Wien
mailing list