[Wien] LAPW1 doesn't show error in parallel calculation

Lyudmila Dobysheva lyuka17 at mail.ru
Wed Sep 9 13:43:15 CEST 2020


09.09.2020 00:01, Peter Blaha wrote:
> alias   testerror       'if (! -z \!:1.error) goto error'
> you can catch a problem.

> Am 08.09.2020 um 20:38 schrieb Yundi Quan:
>> The simplest way that I can think of is to check whether the 
>> lawp1.error file is empty or not after executing x lapw1.

>> On Tue, Sep 8, 2020 at 2:23 PM Rubel, Oleg <rubelo at mcmaster.ca 
>> <mailto:rubelo at mcmaster.ca>> wrote:
>>     I wonder if there is a _simple_ alternative way for sensing an
>>     error? Also message is not always "XXXXX - Error". It can be

Just now I try to make a calculation at supercomputer with a random 
structure for testing, I passed already some problems, but sometimes I 
still meet errors, and there is no nonzero files. I am attaching three 
files:
1. slurm*out, where errors are shown, the first one before lapw0 didn't 
affect, do not know why?, lapw0 was calculated, all output files are 
good. lapw1 was not calculated.

2. *.dayfile I can see that lapw1 was not calculated only by too small 
times:
tesla46(6) 0.006u 0.010s 0.75 2.11%      0+0k 0+0io 0pf+0w
(the next lines are my additional output inserted into lapw1para:
1 t taskset0 exe def_loop.def time srun 0 lapw1 lapw1_1.def)

3. ls-l.output shows that all the *.error files are zero, and the files 
that should be done by lapw1, are absent.

Doesn't matter why the task didn't calculated, but why the 
lapw1*.error's are zero?
I sent for testing run -e lapw1, otherwise it would have come to lapw2 
without stopping.

Best regards
Lyudmila Dobysheva
------------------
http://ftiudm.ru/content/view/25/103/lang,english/
Physics-Techn.Institute,
Udmurt Federal Research Center, Ural Br. of Rus.Ac.Sci.
426000 Izhevsk Kirov str. 132
Russia
---
Tel. +7 (34I2)43-24-59 (office), +7 (9I2)OI9-795O (home)
Skype: lyuka18 (office), lyuka17 (home)
E-mail: lyuka17 at mail.ru (office), lyuka17 at gmail.com (home)

-------------- next part --------------
DIRECTORY = /misc/home4/u3104/work/orgFeZn/Gold_23l
WIENROOT = /misc/home4/u3104/BIN/WIEN2k-19
SCRATCH = ./
Got 16 cores
nodelist tesla46
tasks_per_node 16
slurmstepd: error: _is_a_lwp: open() /proc/408167/status failed: No such file or directory
jobs_per_node 4 because OMP_NUM_THREADS = 4
4 nodes for this job: tesla46 tesla46 tesla46 tesla46
 LAPW0 END
[1]    Done                          srun -K -N1 -n4 -r0 /misc/home4/u3104/BIN/WIEN2k-19/lapw0_mpi lapw0.def >> .time00
slurmstepd: error: execve(): 0: No such file or directory
srun: error: apollo17: task 0: Exited with exit code 2
slurmstepd: error: execve(): 2: No such file or directory
srun: error: apollo17: task 0: Exited with exit code 2
slurmstepd: error: execve(): 1: No such file or directory
srun: error: apollo17: task 0: Exited with exit code 2
slurmstepd: error: execve(): 3: No such file or directory
srun: error: apollo17: task 0: Exited with exit code 2
[4]  - Done                          ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
[3]  + Done                          ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
[2]  + Done                          ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
[1]  + Done                          ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr <STDIN>" )
Gold_23l.scf1_1: No such file or directory.

>   stop
-------------- next part --------------

Calculating Gold_23l in /misc/home4/u3104/work/orgFeZn/Gold_23l
on tesla46 with PID 408380
using WIEN2k_19.1 (Release 25/6/2019) in /misc/home4/u3104/BIN/WIEN2k-19


    start 	(Tue Sep  8 18:57:18 +05 2020) with lapw0 (2/99 to go)

    cycle 1 	(Tue Sep  8 18:57:18 +05 2020) 	(2/99 to go)

>   lapw0   -p	(18:57:18) starting parallel lapw0 at Tue Sep  8 18:57:18 +05 2020
-------- .machine0 : 4 processors
0.056u 0.082s 0:04.65 2.7%	0+0k 16+112io 0pf+0w
>   lapw1  -p    	(18:57:23) starting parallel lapw1 at Tue Sep  8 18:57:23 +05 2020
->  starting parallel LAPW1 jobs at Tue Sep  8 18:57:23 +05 2020
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
1 t taskset0 exe def_loop.def time srun 0 lapw1 lapw1_1.def
1 t taskset0 exe def_loop.def time srun 1 lapw1 lapw1_2.def
1 t taskset0 exe def_loop.def time srun 2 lapw1 lapw1_3.def
1 t taskset0 exe def_loop.def time srun 3 lapw1 lapw1_4.def
     tesla46(6) 0.006u 0.010s 0.75 2.11%      0+0k 0+0io 0pf+0w
     tesla46(5) 0.007u 0.009s 0.75 2.11%      0+0k 0+0io 0pf+0w
     tesla46(5) 0.011u 0.005s 0.75 2.12%      0+0k 0+0io 0pf+0w
     tesla46(5) 0.008u 0.007s 0.68 2.21%      0+0k 0+0io 0pf+0w
   Summary of lapw1para:
   tesla46	 k=21	 user=0.032	 wallclock=184.35
0.268u 0.569s 0:03.29 24.9%	0+0k 6408+1120io 4pf+0w

>   stop
-------------- next part --------------
итого 3088
-rw-r--r-- 1 u3104 users       0 сен  9 16:24 aaa
-rw-r--r-- 1 u3104 users    1312 сен  8 18:53 Gold_23l.dayfile
-rw-r--r-- 1 u3104 users     380 сен  8 18:53 Gold_23l.klist_1
-rw-r--r-- 1 u3104 users     324 сен  8 18:53 Gold_23l.klist_2
-rw-r--r-- 1 u3104 users     324 сен  8 18:53 Gold_23l.klist_3
-rw-r--r-- 1 u3104 users     324 сен  8 18:53 Gold_23l.klist_4
-rw-r--r-- 1 u3104 users    1220 сен  8 18:53 Gold_23l.klist.tmp.u3104.408228
-rw-r--r-- 1 u3104 users     140 сен  8 18:53 Gold_23l.mbjmix
-rw-r--r-- 1 u3104 users   76952 сен  8 18:53 Gold_23l.output0000
-rw-r--r-- 1 u3104 users   49181 сен  8 18:53 Gold_23l.output0001
-rw-r--r-- 1 u3104 users   49181 сен  8 18:53 Gold_23l.output0002
-rw-r--r-- 1 u3104 users   46632 сен  8 18:53 Gold_23l.output0003
-rw-r--r-- 1 u3104 users     280 сен  8 18:53 Gold_23l.scf
-rw-r--r-- 1 u3104 users   17089 сен  8 18:53 Gold_23l.scf0
-rw-r--r-- 1 u3104 users 2505132 сен  8 18:53 Gold_23l.vns
-rw-r--r-- 1 u3104 users       0 сен  8 18:53 Gold_23l.vnsdn
-rw-r--r-- 1 u3104 users  188433 сен  8 18:53 Gold_23l.vsp
-rw-r--r-- 1 u3104 users       0 сен  8 18:53 Gold_23l.vspdn
-rw-r--r-- 1 u3104 users      41 сен  8 18:53 head.diff.u3104.408228
-rw-r--r-- 1 u3104 users    1313 сен  8 18:53 lapw0.def
-rw-r--r-- 1 u3104 users       0 сен  8 18:53 lapw0.error
-rw-r--r-- 1 u3104 users     613 сен  8 18:53 lapw1_1.def
-rw-r--r-- 1 u3104 users       0 сен  8 18:53 lapw1_1.error
-rw-r--r-- 1 u3104 users     565 сен  8 18:53 lapw1_2.def
-rw-r--r-- 1 u3104 users       0 сен  8 18:53 lapw1_2.error
-rw-r--r-- 1 u3104 users     565 сен  8 18:53 lapw1_3.def
-rw-r--r-- 1 u3104 users       0 сен  8 18:53 lapw1_3.error
-rw-r--r-- 1 u3104 users     565 сен  8 18:53 lapw1_4.def
-rw-r--r-- 1 u3104 users       0 сен  8 18:53 lapw1_4.error
-rw-r--r-- 1 u3104 users     601 сен  8 18:53 lapw1.def
-rw-r--r-- 1 u3104 users       0 сен  8 18:53 lapw1.error
-rw-r--r-- 1 u3104 users     199 сен  8 18:53 :log
-rw-r--r-- 1 u3104 users     652 сен  8 18:53 :parallel
-rw-r--r-- 1 u3104 users     162 сен  8 18:53 :parallel_lapw0
-rw-r--r-- 1 u3104 users    2494 сен  8 18:53 slurm-10784772.out
-rw-r--r-- 1 u3104 users     128 сен  8 18:53 slurm.hosts
-rwxrwxr-x 1 u3104 users    3365 сен  8 18:53 slurm.job


More information about the Wien mailing list