[Wien] Error in Hub-U calculation

Gavin Abo gsabo at crimson.ua.edu
Wed Aug 8 09:51:48 CEST 2018


I don't know for sure, but it might be that the /localscratch permission 
denied errors make the parallel calculations broken.

To fix that, the cluster administer would probably have to give you file 
permission to use /localscratch or /localscratch would have to be 
changed to a directory that you have access to.  I believe the 
/localscratch can be changed by running ./siteconfig and selecting "T 
Temp Path" in the menu. However, it seems that the cluster administrator 
would have to do that.  Since, if you were able to do that, it seems 
potentially possible for you to upgrade/install the latest WIEN2k 
version (18.2) to have the +U (orb) serial to parallel fix [ 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg17587.html 
].  While waiting more than two months for a cluster administrator as 
you say, it maybe that you can only do serial calculations on the cluster.

If your still using the job script (job1) in the post

https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg17586.html

that appears to be trying to use mpi for lapw0 and k-point parallel for 
lapw1/2.  The dayfile in that post has "running lapw0 in single mode" 
indicated that mpi lapw0 is not available or not working such that it 
reverted to using its serial calculation mode.  That same dayfile 
appears to be from perhaps multiple continuation attempts with the -NI 
switch (or a save_lapw between "runsp_lapw -p" and "runsp_lapw -p -orb") 
as I see only "lapw0 -p" followed by "orb -up -p".  Thus, it seems to 
have lost possibly helpful and important previous scf cycle output that 
came before it.

For example, for a single mode lapw0 and k-point parallel lapw1/2 using 
only the two commands [1]:

init_lapw -b -sp (with only case.struct, case.indm, and case.inorb files 
in the case folder)
runsp_lapw -p -orb

A typical case.dayfile for that shows an entire scf cycle (lapw0, lapw1, 
lapw2, etc.) before it does "orb -up -p":

Calculating Utest in /home/username/wiendata/Utest
on computername with PID 19009
using WIEN2k_17.1 (Release 30/6/2017) in /home/username/WIEN2k


     start     (Tue Aug  7 22:52:41 MDT 2018) with lapw0 (40/99 to go)

     cycle 1     (Tue Aug  7 22:52:41 MDT 2018)     (40/99 to go)

 >   lapw0  -p    (22:52:41) starting parallel lapw0 at Tue Aug  7 
22:52:41 MDT 2018
-------- .machine0 : processors
running lapw0 in single mode
40.9u 0.2s 0:41.77 98.7% 0+0k 7728+2760io 25pf+0w
 >   lapw1  -up -p  -orb       (22:53:23) starting parallel lapw1 at Tue 
Aug  7 22:53:23 MDT 2018
->  starting parallel LAPW1 jobs at Tue Aug  7 22:53:23 MDT 2018
running LAPW1 in parallel mode (using .machines)
2 number_of_parallel_jobs
      localhost(24) 233.0u 3.7s 3:58.10 99.4% 0+0k 9792+82832io 40pf+0w
      localhost(23) 220.7u 3.5s 3:45.06 99.6% 0+0k 0+77400io 0pf+0w
    Summary of lapw1para:
    localhost     k=47     user=453.7     wallclock=463.16
454.2u 7.4s 3:59.43 192.8% 0+0k 10048+160632io 43pf+0w
 >   lapw1  -dn -p  -orb       (22:57:22) starting parallel lapw1 at Tue 
Aug  7 22:57:22 MDT 2018
->  starting parallel LAPW1 jobs at Tue Aug  7 22:57:22 MDT 2018
running LAPW1 in parallel mode (using .machines.help)
2 number_of_parallel_jobs
      localhost(24) 233.3u 3.4s 3:57.62 99.6% 0+0k 0+83072io 0pf+0w
      localhost(23) 221.3u 3.3s 3:45.34 99.6% 0+0k 0+77528io 0pf+0w
    Summary of lapw1para:
    localhost     k=47     user=454.6     wallclock=462.96
455.0u 6.9s 3:58.84 193.4% 0+0k 0+160976io 0pf+0w
 >   lapw2 -up -p     -orb     (23:01:21) running LAPW2 in parallel mode
       localhost 69.5u 0.8s 1:10.72 99.5% 0+0k 0+1408io 0pf+0w
       localhost 69.1u 0.7s 1:10.08 99.6% 0+0k 1376+1312io 4pf+0w
    Summary of lapw2para:
    localhost     user=138.6     wallclock=140.8
139.5u 1.6s 1:12.99 193.4% 0+0k 5952+6176io 19pf+0w
 >   lapw2 -dn -p     -orb     (23:02:34) running LAPW2 in parallel mode
       localhost 59.4u 0.8s 1:00.56 99.6% 0+0k 0+1408io 0pf+0w
       localhost 59.6u 0.7s 1:00.54 99.7% 0+0k 0+1312io 0pf+0w
    Summary of lapw2para:
    localhost     user=119     wallclock=121.1
120.0u 1.6s 1:02.75 193.8% 0+0k 0+6160io 0pf+0w
 >   lcore -up    (23:03:37) 0.0u 0.0s 0:00.13 61.5% 0+0k 1688+552io 6pf+0w
 >   lcore -dn    (23:03:37) 0.0u 0.0s 0:00.08 100.0% 0+0k 0+552io 0pf+0w
 >   mixer  -orb    (23:03:38) 0.3u 0.0s 0:00.51 70.5% 0+0k 4864+3680io 
18pf+0w
:ENERGY convergence:  0 0.0001 0
:CHARGE convergence:  0 0.0000 0

     cycle 2     (Tue Aug  7 23:03:38 MDT 2018)     (39/98 to go)

 >   lapw0  -p    (23:03:38) starting parallel lapw0 at Tue Aug  7 
23:03:38 MDT 2018
-------- .machine0 : processors
running lapw0 in single mode
45.2u 0.1s 0:45.54 99.7% 0+0k 8+2760io 0pf+0w
 >   orb -up -p    (23:04:24) 0.0u 0.0s 0:00.05 0.0% 0+0k 1464+32io 5pf+0w
 >   orb -dn -p    (23:04:24) 0.0u 0.0s 0:00.01 0.0% 0+0k 0+32io 0pf+0w
 >   lapw1  -up -p  -orb       (23:04:24) starting parallel lapw1 at Tue 
Aug  7 23:04:24 MDT 2018
->  starting parallel LAPW1 jobs at Tue Aug  7 23:04:24 MDT 2018
running LAPW1 in parallel mode (using .machines)
2 number_of_parallel_jobs
      localhost(24) 235.1u 3.6s 3:59.61 99.6% 0+0k 0+72664io 0pf+0w
      localhost(23) 222.6u 3.5s 3:46.86 99.6% 0+0k 0+68280io 0pf+0w
    Summary of lapw1para:
    localhost     k=47     user=457.7     wallclock=466.47
458.2u 7.2s 4:00.84 193.2% 0+0k 0+141336io 0pf+0w
 >   lapw1  -dn -p  -orb       (23:08:25) starting parallel lapw1 at Tue 
Aug  7 23:08:25 MDT 2018
->  starting parallel LAPW1 jobs at Tue Aug  7 23:08:25 MDT 2018
running LAPW1 in parallel mode (using .machines.help)
2 number_of_parallel_jobs
      localhost(24) 234.0u 3.6s 3:58.53 99.6% 0+0k 0+72440io 0pf+0w
      localhost(23) 221.3u 3.6s 3:45.78 99.6% 0+0k 0+68096io 0pf+0w
    Summary of lapw1para:
    localhost     k=47     user=455.3     wallclock=464.31
455.8u 7.4s 3:59.76 193.2% 0+0k 0+140912io 0pf+0w
 >   lapw2 -up -p     -orb     (23:12:25) running LAPW2 in parallel mode
       localhost 68.4u 0.8s 1:09.55 99.5% 0+0k 0+1416io 0pf+0w
       localhost 68.2u 0.7s 1:09.40 99.5% 0+0k 0+1328io 0pf+0w
    Summary of lapw2para:
    localhost     user=136.6     wallclock=138.95
137.5u 1.6s 1:11.58 194.4% 0+0k 0+5976io 0pf+0w
 >   lapw2 -dn -p     -orb     (23:13:37) running LAPW2 in parallel mode
       localhost 58.6u 0.8s 0:59.71 99.5% 0+0k 0+1408io 0pf+0w
       localhost 59.0u 0.8s 1:00.03 99.7% 0+0k 0+1320io 0pf+0w
    Summary of lapw2para:
    localhost     user=117.6     wallclock=119.74
118.4u 1.6s 1:02.16 193.3% 0+0k 0+5952io 0pf+0w
 >   lcore -up    (23:14:39) 0.0u 0.0s 0:00.07 100.0% 0+0k 0+552io 0pf+0w
 >   lcore -dn    (23:14:39) 0.0u 0.0s 0:00.08 100.0% 0+0k 0+552io 0pf+0w
 >   mixer  -orb    (23:14:39) 0.3u 0.0s 0:00.35 97.1% 0+0k 0+5488io 0pf+0w
:ENERGY convergence:  0 0.0001 0
:CHARGE convergence:  0 0.0000 0

 >   stop due to .stop file

[1] 
https://www.icts.res.in/sites/default/files/blaha-2014_correlated_school-bangalore-exercises.pdf 
(slide 17)

On 8/7/2018 12:49 PM, shaymlal dayananda wrote:
>
>
> On Tuesday, August 7, 2018 12:41 PM, shaymlal dayananda 
> <kcsdayananda at yahoo.com> wrote:
>
>
> Dear Developers and users
>
> I am in a trouble that I could not recover my problem yet.
>
> /I tried /all of your suggestions. I am summarizing them below.
>
> 1. I finished runsp_lapw  -NI -p -ec 0.0001 correctly and saved the 
> files.
> 2. I added case.indmc and case.inorb files ( I have copied them one of 
> previous emails)
> 3. submitted the job  runsp_lapw  -NI -p -dm -orb -ec 0.0001
> 4. This job stopped with the same previous errors. Anyway I am coping 
> them again along with case.vorbdef
>
> STDOUT
> cp: cannot create regular file '/localscratch//.tmp_lapw1para.30532': 
> Permission denied
> cp: cannot stat '/localscratch//.tmp_lapw1para.30532': No such file or 
> directory
> /localscratch//.tmp_testpara_new.30532_2: Permission denied.
> grep: /localscratch//.tmp_lapw1para.30532: No such file or directory
> cut: /localscratch//.tmp_testpara_new.30532_2: No such file or directory
> forrtl: severe (24): end-of-file during read, unit 7, file 
> /scratch/WIEN2k17/TEST/T-hubU/T-hubU.vorbup
> Image              PC Routine            Line Source
> lapw1c             0000000000461858 Unknown               Unknown  Unknown
> lapw1c             0000000000497DDA Unknown               Unknown  Unknown
> lapw1c             000000000042C543 inilpw_                   276  
> inilpw.f
> lapw1c             000000000042F302 MAIN__                     42 
> lapw1_tmp_.F
> lapw1c             0000000000403FEE Unknown               Unknown  Unknown
> libc.so.6          00002AD5081B12E0 Unknown               Unknown  Unknown
> lapw1c             0000000000403EEA Unknown               Unknown  Unknown
>
>
> 5. next I tried:  energyup/dn copied to energyup_1/dn_1, all dmat* 
> removed and then tried runsp_lapw -NI -p -dm -orb -ec 0.0001
>   But it gave me the same error. I has changed the energyup_1/dn_1 
> (empty now)
>
> 6. I tried running x lapw1 with deleting dmat* as William as sugested, 
> but this also leads for the same error as I have given in 5. above.
>
>
> I am very much appreciate if anyone give the correct way to do hubbard 
> U included PARALLEL calculation with Wien2k 17.1.
>
> I have asked our supercomputers to install the latest version. But due 
> to internal issue this will be late more than two months more. I 
> cannot wait that long.
>
> Thank you
>
> Daya
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20180808/9601d1bc/attachment.html>


More information about the Wien mailing list