[Wien] Spin-orbit coupling crash
Gavin Abo
gsabo at crimson.ua.edu
Tue Oct 1 14:08:57 CEST 2019
Which 2016 ifort? Check in a terminal with: ifort -v
The Update 3 (16.0.3.210) in particular was bad to use [1,2].
Below, I see libmkl_blacs_inte, which likely indicates you are using
impi. You might need the Intel 2019 update 5 having the memory leak fix
[3,4].
The process interrupted (SIGINT) might be the main cause. That can
happen if you used Ctrl-C [5]. I cannot remember, but it might also
happen if you hit the walltime limit [6] or if the job stopped after you
closed the terminal window shell [7].
[1]
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg15459.html
[2]
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg17284.html
[3]
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg19050.html
[4]
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg18798.html
[5]
http://zeus.theochem.tuwien.ac.at/pipermail/wien/2008-November/011824.html
[6]
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2014-January/064357.html
[7]
https://stackoverflow.com/questions/38840656/nohup-command-in-submitting-jobs-to-cluster
On 10/1/2019 2:31 AM, Luigi Maduro - TNW wrote:
>
> Dear WIEN2k users,
>
> I am trying to carry out a calculation on a supercell of MoS2 with
> spin-orbit coupling in parallel mode using the WIEN2k_19.1 version.
> The calculation runs fine for lapw0 and lapw1, however when it reaches
> lapwso the calculation crashes and gives the following error:
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> LAPW0 END
>
> [1] Done mpirun -np 120 -machinefile
> .machine0 /home/WIEN2k_19_2/lapw0_mpi lapw0.def >> .time00
>
> LAPW1 END
>
> LAPW1 END
>
> [4] Done ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> LAPW1 END
>
> LAPW1 END
>
> LAPW1 END
>
> LAPW1 END
>
> [6] + Done ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> [5] + Done ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> [3] + Done ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> [2] + Done ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> [1] + Done ( cd $PWD; $t $ttt; rm -f
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> forrtl: severe (39): error during read, unit 9, file
> /home/Data/MoS2_SO/MoS2_SO.vector_1
>
> Image PC Routine Line Source
>
> lapwso_mpi 000000000046BC13 Unknown Unknown Unknown
>
> lapwso_mpi 0000000000490934 Unknown Unknown Unknown
>
> lapwso_mpi 0000000000429158 kptin_ 60 kptin.F
>
> lapwso_mpi 000000000042F7EE MAIN__ 570
> lapwso.F
>
> lapwso_mpi 0000000000405C5E Unknown Unknown Unknown
>
> libc.so.6 00002B04C2A12B35 Unknown Unknown Unknown
>
> lapwso_mpi 0000000000405B69 Unknown Unknown Unknown
>
> forrtl: error (69): process interrupted (SIGINT)
>
> Image PC Routine Line Source
>
> lapwso_mpi 0000000000523F95 Unknown Unknown Unknown
>
> lapwso_mpi 0000000000521BB7 Unknown Unknown Unknown
>
> lapwso_mpi 00000000004D8084 Unknown Unknown Unknown
>
> lapwso_mpi 00000000004D7E96 Unknown Unknown Unknown
>
> lapwso_mpi 000000000046C929 Unknown Unknown Unknown
>
> lapwso_mpi 000000000047140E Unknown Unknown Unknown
>
> libpthread.so.0 00002B2A5349B370 Unknown Unknown Unknown
>
> libmpi.so.12 00002B2A58D16455 Unknown Unknown Unknown
>
> libmpi.so.12 00002B2A58F52D74 Unknown Unknown Unknown
>
> libmkl_blacs_inte 00002B2A547FC015 Unknown Unknown Unknown
>
> libmkl_blacs_inte 00002B2A547FF9A9 Unknown Unknown Unknown
>
> libmkl_blacs_inte 00002B2A547DDF96 Unknown Unknown Unknown
>
> lapwso_mpi 0000000000429FFB kptin_ 108 kptin.F
>
> lapwso_mpi 000000000042F7EE MAIN__ 570
> lapwso.F
>
> lapwso_mpi 0000000000405C5E Unknown Unknown Unknown
>
> libc.so.6 00002B2A595F5B35 Unknown Unknown Unknown
>
> lapwso_mpi 0000000000405B69 Unknown Unknown Unknown
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> I have used the intel_xe_2016 compiler to compile WIEN2k_19.1. I am
> using a Beowulf style cluster where each individual node is a shared
> memory machine and runs CentOS 7. Ascheduler (Maui) and a resource
> manager (Torque) are both running on the master node. I have written a
> script to create a .machines file on the fly, and for this calculation
> it looks like this:
>
> 1:n05-07:20
>
> 1:n05-08:20
>
> 1:n05-09:20
>
> 1:n05-10:20
>
> 1:n05-11:20
>
> 1:n05-12:20
>
> lapw0:n05-07:20 n05-08:20 n05-09:20 n05-10:20 n05-11:20 n05-12:20
>
> dstart:n05-07:20 n05-08:20 n05-09:20 n05-10:20 n05-11:20 n05-12:20
>
> nlvdw:n05-07:20 n05-08:20 n05-09:20 n05-10:20 n05-11:20 n05-12:20
>
> Any suggestions for finding/fixing the cause of the crash are highly
> appreciated. J
>
> Kind regards,
>
> Luigi Maduro
>
> PhD candidate
> Kavli Institute of Nanoscience
>
> Department of Quantum Nanoscience
>
> Faculty of Applied Sciences
>
> Delft University of Technology
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20191001/1ca10765/attachment.html>
More information about the Wien
mailing list