[Wien] LAPW1 doesn't show error in parallel calculation

Laurence Marks laurence.marks at gmail.com
Wed Sep 9 14:57:15 CEST 2020


Unfortunately the structure of *.error files which are zero length when the
task runs correctly can easily be broken if there is remote
execution/ssh/mpi which does not work. I think in the cases you sent there
is sufficient information to debug; I suspect an issue with directory names
and/or mount.

Suggestion to Peter: perhaps add a "echo Startup Error > lapw1[0-2].error"
in lapw1[0-2]para to catch this?

_____
Professor Laurence Marks
"Research is to see what everybody else has seen, and to think what nobody
else has thought", Albert Szent-Gyorgi
www.numis.northwestern.edu

On Wed, Sep 9, 2020, 06:48 Lyudmila Dobysheva <lyuka17 at mail.ru> wrote:

> 09.09.2020 00:01, Peter Blaha wrote:
> > alias   testerror       'if (! -z \!:1.error) goto error'
> > you can catch a problem.
>
> > Am 08.09.2020 um 20:38 schrieb Yundi Quan:
> >> The simplest way that I can think of is to check whether the
> >> lawp1.error file is empty or not after executing x lapw1.
>
> >> On Tue, Sep 8, 2020 at 2:23 PM Rubel, Oleg <rubelo at mcmaster.ca
> >> <mailto:rubelo at mcmaster.ca>> wrote:
> >>     I wonder if there is a _simple_ alternative way for sensing an
> >>     error? Also message is not always "XXXXX - Error". It can be
>
> Just now I try to make a calculation at supercomputer with a random
> structure for testing, I passed already some problems, but sometimes I
> still meet errors, and there is no nonzero files. I am attaching three
> files:
> 1. slurm*out, where errors are shown, the first one before lapw0 didn't
> affect, do not know why?, lapw0 was calculated, all output files are
> good. lapw1 was not calculated.
>
> 2. *.dayfile I can see that lapw1 was not calculated only by too small
> times:
> tesla46(6) 0.006u 0.010s 0.75 2.11%      0+0k 0+0io 0pf+0w
> (the next lines are my additional output inserted into lapw1para:
> 1 t taskset0 exe def_loop.def time srun 0 lapw1 lapw1_1.def)
>
> 3. ls-l.output shows that all the *.error files are zero, and the files
> that should be done by lapw1, are absent.
>
> Doesn't matter why the task didn't calculated, but why the
> lapw1*.error's are zero?
> I sent for testing run -e lapw1, otherwise it would have come to lapw2
> without stopping.
>
> Best regards
> Lyudmila Dobysheva
> ------------------
>
> https://urldefense.com/v3/__http://ftiudm.ru/content/view/25/103/lang,english/__;!!Dq0X2DkFhyF93HkjWTBQKhk!Cc2li1FWPTknXFHo7SLSTcHwYxmAXYvt52a4_PqAO7th-nFUOo9Iemg70fG8N1JIo8uRXg$
> Physics-Techn.Institute,
> Udmurt Federal Research Center, Ural Br. of Rus.Ac.Sci.
> 426000 Izhevsk Kirov str. 132
> Russia
> ---
> Tel. +7 (34I2)43-24-59 (office), +7 (9I2)OI9-795O (home)
> Skype: lyuka18 (office), lyuka17 (home)
> E-mail: lyuka17 at mail.ru (office), lyuka17 at gmail.com (home)
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
>
> https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!Cc2li1FWPTknXFHo7SLSTcHwYxmAXYvt52a4_PqAO7th-nFUOo9Iemg70fG8N1L-bFCp3A$
> SEARCH the MAILING-LIST at:
> https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!Cc2li1FWPTknXFHo7SLSTcHwYxmAXYvt52a4_PqAO7th-nFUOo9Iemg70fG8N1IXddgg7w$
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20200909/cf87b6ac/attachment.htm>


More information about the Wien mailing list