[Wien] k point parallel calculations
Peter Blaha
pblaha at theochem.tuwien.ac.at
Wed Feb 25 07:40:29 CET 2015
Looks like a problem with the network / fileserver / NFS.
It looks as if some files are not written properly.
Look into mixer at line 168. It is reading something incorrectly ...
(which could again be a NFS problem.
Are you sure that there is nothing wrong with case.scf1_* ?
Do ls -alsrt *scf1_*
Are all files written properly (till the end, check their size),
are the dates/timestamp correct ?
Try parallelization with fewer nodes.
Am 24.02.2015 um 19:19 schrieb Priyanka Seth:
> Hello all,
>
> I have been trying to run some k-point parallel calculations for some large structures and have been having problems for versions 12, 13 and 14 on an ifort compilation. In
> all cases, I am running on the same number of cores as k vectors. Note that calculations begun from the same input and run on a single core calculation run without any
> problems.
>
> v12/v13
> =====
>
> This is the output for versions 12 and 13 (I've removed the node-dependent lines):
>
> LAPW0 END
> LAPW1 END
> LAPW2 - FERMI; weighs written
> LAPW2 END
> SUMPARA END
> CORE END
> forrtl: severe (59): list-directed I/O syntax error, unit -5, file Internal List-Directed Read
> Image PC Routine Line Source
> mixer 000000000051693D Unknown Unknown Unknown
> mixer 0000000000515445 Unknown Unknown Unknown
> mixer 00000000004BC9E0 Unknown Unknown Unknown
> mixer 000000000046F4BA Unknown Unknown Unknown
> mixer 000000000046ECB0 Unknown Unknown Unknown
> mixer 0000000000492B76 Unknown Unknown Unknown
> mixer 000000000049043B Unknown Unknown Unknown
> mixer 0000000000407E7E MAIN__ 168 mixer.F
> mixer 000000000040414C Unknown Unknown Unknown
> libc.so.6 00000037C241D994 Unknown Unknown Unknown
> mixer 0000000000403FC9 Unknown Unknown Unknown
>
> > stop error
>
> Looking at the error files, I have "Error in MIXER" in both versions.
>
> The dayfile ends as follows:
> 1.884u 0.844s 0:09.73 27.9% 0+0k 0+0io 8pf+0w
> > lcore (09:33:51) 0.046u 0.007s 0:00.14 28.5% 0+0k 0+0io 7pf+0w
> > mixer (09:33:51) 0.000u 0.005s 0:00.04 0.0% 0+0k 0+0io 8pf+0w
> error: command /home/pseth/SOURCES/WIEN2K_v13/mixer mixer.def failed
>
> > stop error
>
>
> v14
> ===
>
> I get to the second cycle, but then the calculation crashes with "Error in LAPW1" in lapw1_*.error:
>
> LAPW2 END
> SUMPARA END
> CORE END
> MIXER END
> ec cc and fc_conv 0 0 1
> in cycle 2 ETEST: 0 CTEST: 0
> LAPW0 END
>
> There is nothing obviously wrong looking at the case.scf1_* files or at the dayfile which ends like this:
>
> > lapw1 -p (09:37:40) starting parallel lapw1 at Tue Feb 10 09:37:40 CET 2015
> -> starting parallel LAPW1 jobs at Tue Feb 10 09:37:40 CET 2015
> running LAPW1 in parallel mode (using .machines)
> 24 number_of_parallel_jobs
> [1] 30405
> [2] 30437
> [3] 30471
> [4] 30507
> [5] 30559
> [6] 30606
> [7] 30653
> [8] 30717
> [9] 30809
> [10] 30916
> [11] 31000
> [12] 31070
> [13] 31192
> [14] 31329
> [15] 31428
> [16] 31504
> [17] 31664
> [18] 31788
> [19] 31871
> [20] 31900
> [21] 31928
> [22] 31956
> [23] 31982
> [24] 32010
> [5] Done ( ( $remote $machine[$p] ...
>
>
> I understand that this is not much information to go on, but I don't really know where else to look! Has anyone had similar issues? What else would help in diagnosing the
> problem?
>
> Many thanks,
> Priyanka
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
--
-----------------------------------------
Peter Blaha
Inst. Materials Chemistry, TU Vienna
Getreidemarkt 9, A-1060 Vienna, Austria
Tel: +43-1-5880115671
Fax: +43-1-5880115698
email: pblaha at theochem.tuwien.ac.at
-----------------------------------------
More information about the Wien
mailing list