[Wien] Problem when running MPI-parallel version of LAPW0

Rémi Arras remi.arras at cemes.fr
Thu Oct 23 11:04:19 CEST 2014


Thank you everybody for your answers.
For the .machines file, we already have a script and it is well generated.
We will try to verify again the links and test another version of the 
fftw3-library.  I will keep you informed if the problem is solved.

Best regards,
Rémi Arras

Le 22/10/2014 14:22, Peter Blaha a écrit :
> Usually the "crucial" point for lapw0  is the fftw3-library.
>
> I noticed you have fftw-3.3.4, which I never tested. Since fftw is 
> incompatible between fftw2 and 3, maybe they have done something again 
> ...
>
> Besides that, I assume you have installed fftw using the same ifor and 
> mpi versions ...
>
>
>
> On 10/22/2014 01:29 PM, Rémi Arras wrote:
>> Dear Pr. Blaha, Dear Wien2k users,
>>
>> We tried to install the last version of Wien2k (14.1) on a supercomputer
>> and we are facing some troubles with the MPI parallel version.
>>
>> 1)lapw0 is running correctly in sequential, but crashes systematically
>> when the parallel option is activated (independently of the number of
>> cores we use):
>>
>>> lapw0 -p(16:08:13) starting parallel lapw0 at lun. sept. 29 16:08:13
>> CEST 2014
>> -------- .machine0 : 4 processors
>> Child id1 SIGSEGV
>> Child id2 SIGSEGV
>> Child id3 SIGSEGV
>> Child id0 SIGSEGV
>> **lapw0 crashed!
>> 0.029u 0.036s 0:50.91 0.0%0+0k 5248+104io 17pf+0w
>> error: command/eos3/p1229/remir/INSTALLATION_WIEN/14.1/lapw0para -up -c
>> lapw0.deffailed
>>> stop error
>>
>> w2k_dispatch_signal(): received: Segmentation fault
>> w2k_dispatch_signal(): received: Segmentation fault
>> Child with myid of1has an error
>> 'Unknown' - SIGSEGV
>> Child id1 SIGSEGV
>> application called MPI_Abort(MPI_COMM_WORLD, 0) - process 1
>> **lapw0 crashed!
>> cat: No match.0.027u 0.034s 1:33.13 0.0%0+0k 5200+96io 16pf+0w
>> error: command/eos3/p1229/remir/INSTALLATION_WIEN/14.1/lapw0para -up -c
>> lapw0.deffailed
>>
>>
>> 2) lapw2 also crashes sometimes when MPI parallelization is used.
>> Sequential or k-parallel runs are ok, and contrary to lapw0, the error
>> does not occur for all cases (we did not notice any problem when testing
>> the mpi benchmark with lapw1):
>>
>> w2k_dispatch_signal(): received: Segmentation fault application called
>> MPI_Abort(MPI_COMM_WORLD, 768) - process 0
>>
>> Our system is a Bullx DLC Cluster (LInux Red Hat+ Intel Ivybridge) and
>> we use the compiler(+mkl) intel/14.0.2.144 and intelmpi/4.1.3.049.
>> The batch Scheduler is SLURM.
>>
>> Here are the settings and the options we used for the installation :
>>
>> OPTIONS:
>> current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback
>> current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML
>> -Dmkl_scalapack -traceback -xAVX
>> current:FFTW_OPT:-DFFTW3
>> -I/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/include
>> current:FFTW_LIBS:-lfftw3_mpi -lfftw3
>> -L/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/lib
>> current:LDFLAGS:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread
>> current:DPARALLEL:'-DParallel'
>> current:R_LIBS:-lmkl_lapack95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread
>> -lmkl_core -openmp -lpthread
>> current:RP_LIBS:-mkl=cluster -lfftw3_mpi -lfftw3
>> -L/users/p1229/remir/INSTALLATION_WIEN/fftw-3.3.4-Intel_MPI/lib
>> current:MPIRUN:mpirun -np _NP_ _EXEC_
>> current:MKL_TARGET_ARCH:intel64
>>
>> PARALLEL_OPTIONS:
>> setenv TASKSET "no"
>> setenv USE_REMOTE 1
>> setenv MPI_REMOTE 1
>> setenv WIEN_GRANULARITY 1
>> setenv WIEN_MPIRUN "mpirun -np _NP_ _EXEC_"
>>
>> Any suggestions which could help us to solve this problem would be
>> greatly appreciated.
>>
>> Best regards,
>> Rémi Arras
>>
>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at: 
>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>>
>




More information about the Wien mailing list