[Wien] Spin-orbit coupling crash

Tue Oct 1 14:08:57 CEST 2019

Which 2016 ifort?  Check in a terminal with: ifort -v

The Update 3 (16.0.3.210) in particular was bad to use [1,2].

Below, I see libmkl_blacs_inte, which likely indicates you are using 
impi.  You might need the Intel 2019 update 5 having the memory leak fix 
[3,4].

The process interrupted (SIGINT) might be the main cause.  That can 
happen if you used Ctrl-C [5]. I cannot remember, but it might also 
happen if you hit the walltime limit [6] or if the job stopped after you 
closed the terminal window shell [7].

[1] 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg15459.html
[2] 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg17284.html
[3] 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg19050.html
[4] 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg18798.html
[5] 
http://zeus.theochem.tuwien.ac.at/pipermail/wien/2008-November/011824.html
[6] 
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2014-January/064357.html
[7] 
https://stackoverflow.com/questions/38840656/nohup-command-in-submitting-jobs-to-cluster

On 10/1/2019 2:31 AM, Luigi Maduro - TNW wrote:
>
> Dear WIEN2k users,
>
> I am trying to carry out a calculation on a supercell of MoS2 with 
> spin-orbit coupling in parallel mode using the WIEN2k_19.1 version. 
> The calculation runs fine for lapw0 and lapw1, however when it reaches 
> lapwso the calculation crashes and gives the following error:
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> LAPW0 END
>
> [1]    Done                          mpirun -np 120 -machinefile 
> .machine0 /home/WIEN2k_19_2/lapw0_mpi lapw0.def >> .time00
>
> LAPW1 END
>
> LAPW1 END
>
> [4]    Done                          ( cd $PWD; $t $ttt; rm -f 
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> LAPW1 END
>
> LAPW1 END
>
> LAPW1 END
>
> LAPW1 END
>
> [6]  + Done                          ( cd $PWD; $t $ttt; rm -f 
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> [5]  + Done                          ( cd $PWD; $t $ttt; rm -f 
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> [3]  + Done                          ( cd $PWD; $t $ttt; rm -f 
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> [2]  + Done                          ( cd $PWD; $t $ttt; rm -f 
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> [1]  + Done                          ( cd $PWD; $t $ttt; rm -f 
> .lock_$lockfile[$p] ) >> .time1_$loop
>
> forrtl: severe (39): error during read, unit 9, file 
> /home/Data/MoS2_SO/MoS2_SO.vector_1
>
> Image              PC Routine            Line        Source
>
> lapwso_mpi         000000000046BC13 Unknown               Unknown  Unknown
>
> lapwso_mpi         0000000000490934 Unknown               Unknown  Unknown
>
> lapwso_mpi         0000000000429158 kptin_                     60  kptin.F
>
> lapwso_mpi         000000000042F7EE MAIN__                    570  
> lapwso.F
>
> lapwso_mpi         0000000000405C5E Unknown               Unknown  Unknown
>
> libc.so.6          00002B04C2A12B35 Unknown               Unknown  Unknown
>
> lapwso_mpi         0000000000405B69 Unknown               Unknown  Unknown
>
> forrtl: error (69): process interrupted (SIGINT)
>
> Image              PC Routine            Line        Source
>
> lapwso_mpi         0000000000523F95 Unknown               Unknown  Unknown
>
> lapwso_mpi         0000000000521BB7 Unknown               Unknown  Unknown
>
> lapwso_mpi         00000000004D8084 Unknown               Unknown  Unknown
>
> lapwso_mpi         00000000004D7E96 Unknown               Unknown  Unknown
>
> lapwso_mpi         000000000046C929 Unknown               Unknown  Unknown
>
> lapwso_mpi         000000000047140E Unknown               Unknown  Unknown
>
> libpthread.so.0    00002B2A5349B370 Unknown               Unknown  Unknown
>
> libmpi.so.12       00002B2A58D16455 Unknown               Unknown  Unknown
>
> libmpi.so.12       00002B2A58F52D74 Unknown               Unknown  Unknown
>
> libmkl_blacs_inte 00002B2A547FC015  Unknown               Unknown  Unknown
>
> libmkl_blacs_inte  00002B2A547FF9A9 Unknown               Unknown  Unknown
>
> libmkl_blacs_inte  00002B2A547DDF96 Unknown               Unknown  Unknown
>
> lapwso_mpi         0000000000429FFB kptin_                    108  kptin.F
>
> lapwso_mpi         000000000042F7EE MAIN__                    570  
> lapwso.F
>
> lapwso_mpi         0000000000405C5E Unknown               Unknown  Unknown
>
> libc.so.6          00002B2A595F5B35 Unknown               Unknown  Unknown
>
> lapwso_mpi         0000000000405B69 Unknown               Unknown  Unknown
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> I have used the intel_xe_2016 compiler to compile WIEN2k_19.1. I am 
> using a Beowulf style cluster where each individual node is a shared 
> memory machine and runs CentOS 7. Ascheduler (Maui) and a resource 
> manager (Torque) are both running on the master node. I have written a 
> script to create a .machines file on the fly, and for this calculation 
> it looks like this:
>
> 1:n05-07:20
>
> 1:n05-08:20
>
> 1:n05-09:20
>
> 1:n05-10:20
>
> 1:n05-11:20
>
> 1:n05-12:20
>
> lapw0:n05-07:20 n05-08:20 n05-09:20 n05-10:20 n05-11:20 n05-12:20
>
> dstart:n05-07:20 n05-08:20 n05-09:20 n05-10:20 n05-11:20 n05-12:20
>
> nlvdw:n05-07:20 n05-08:20 n05-09:20 n05-10:20 n05-11:20 n05-12:20
>
> Any suggestions for finding/fixing the cause of the crash are highly 
> appreciated. J
>
> Kind regards,
>
> Luigi Maduro
>
> PhD candidate
> Kavli Institute of Nanoscience
>
> Department of Quantum Nanoscience
>
> Faculty of Applied Sciences
>
> Delft University of Technology
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20191001/1ca10765/attachment.html>