[Wien] lapw2 crash

Scott Beardsley scott at cse.ucdavis.edu
Tue Dec 23 00:12:01 CET 2008


Peter Blaha wrote:
> No, WIEN should not do ssh when USE_REMOTE 0
> 
> Try  "x lapw2 -p -up"

This fails as well (see the snip below from the dayfile):

...
if ( 1 > 1 ) echo sleeping for 1 seconds
sleep 1
mpirun -np 16 -machinefile .machines.sge 
/share/apps/wien-2k_08/lapw2_mpi uplapw2_1.def 1
hostname
jobs -l
endif
@ p ++
end
while ( 2 < = 1 )
end
while ( 1 < 1 )
if ( 1 > 0 ) echo
echo
if ( 1 > 0 ) echo waiting for processes:
echo waiting for processes:
wait
[icompute-4-1.local:27334] *** An error occurred in MPI_Comm_split
[icompute-1-30.local:11280] *** An error occurred in MPI_Comm_split
[icompute-1-19.local:06229] *** An error occurred in MPI_Comm_split
[icompute-1-19.local:06229] *** on communicator MPI_COMM_WORLD
[icompute-1-19.local:06229] *** MPI_ERR_ARG: invalid argument of some 
other kind
[icompute-1-19.local:06229] *** MPI_ERRORS_ARE_FATAL (goodbye)
[icompute-1-35.local:08913] *** An error occurred in MPI_Comm_split
[icompute-1-35.local:08913] *** on communicator MPI_COMM_WORLD
[icompute-1-35.local:08913] *** MPI_ERR_ARG: invalid argument of some 
other kind
[icompute-1-35.local:08913] *** MPI_ERRORS_ARE_FATAL (goodbye)
[icompute-4-1.local:27335] *** An error occurred in MPI_Comm_split
[icompute-4-1.local:27335] *** on communicator MPI_COMM_WORLD

...

[icompute-1-35.local:08915] *** MPI_ERRORS_ARE_FATAL (goodbye)
[icompute-4-1.local:27210] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 275
[icompute-4-1.local:27210] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
pls_gridengine_module.c at line 791
[icompute-4-1.local:27210] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
errmgr_hnp.c at line 90
[icompute-4-1.local:27210] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 188
[icompute-4-1.local:27210] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
pls_gridengine_module.c at line 826
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons for this job. 
Returned value Timeout instead of ORTE_SUCCESS.
--------------------------------------------------------------------------
[icompute-4-1.local:27331] OOB: Connection to HNP lost
[icompute-1-35.local:08911] OOB: Connection to HNP lost
[icompute-1-30.local:11279] OOB: Connection to HNP lost
rm -f .lock_icompute-0-101
sleep 1
set i = 1
while ( 1 < = 1 )
if ( ! -z uplapw2_1.error ) goto error
goto error
...

> Check the lapw2para_lapw script-
> You may also add -x in the header of lapw2para_lapw. This will give a long
> debug output of the shell script starting the fortran jobs.

I added a -x and some debug commands (including an mpirun to test the 
interconnect). It is pointing at lapw2_mpi. For some reason it is 
terminating abnormally.

> Actually, lapw2para calls first
> 
> lapw2 uplapw2.def N   with the "-fermi" option, i.e. no mpi-binary
> 
> and only then it would start the   lapw2_mpi executables.

This part seems to complete successfully (the snip from the dayfile):

...
if ( LSDA_2 ==  ) then
if ( 1 > 0 ) echo Setting up case LSDA_2 for parallel execution
echo Setting up case LSDA_2 for parallel execution
if ( 1 > 0 ) echo of LAPW2
echo of LAPW2
if ( 1 > 0 ) echo
echo
set fermi = `head -1 $case.in2$cmplx$eece|cut -c-5`
head -1 LSDA_2.in2
cut -c-5
if ( TOT == QTL ) then
if ( TOT == EFG ) then
if ( TOT == FERMI ) then
cp LSDA_2.in2 .in.tmp
echo FERMI
set len = `wc .in.tmp`
wc .in.tmp
@ len --
tail -6 LSDA_2.in2
cp .in.tmp1 LSDA_2.in2
echo ->  starting Fermi on icompute-4-1.local at `date`
date
touch LSDA_2.weighup_ LSDA_2.clmvalup_1 LSDA_2.vrespvalup_1 
LSDA_2.helpup_1 LSDA_2.scf2up_1
rm LSDA_2.weighup_ LSDA_2.weighup_1 LSDA_2.clmvalup_1 
LSDA_2.vrespvalup_1 LSDA_2.helpup_1 LSDA_2.scf2up_1
lapw2 uplapw2.def 1
  STOP LAPW2 - FERMI; weighs written
  STOP
cp .in.tmp LSDA_2.in2
rm .in.tmp .in.tmp1
if ( TOT == FERMI ) then
if ( ! -z uplapw2.error ) goto error
if ( 1 > 0 ) echo
echo
if ( 1 > 0 ) echo -n creating uplapw2_*.def:
echo -n creating uplapw2_*.def:
...


More information about the Wien mailing list