[Wien] running jobs without $SCRATCH

jadhikari@clarku.edu jadhikari at clarku.edu
Fri Apr 6 23:36:46 CEST 2007


Hi,
Thank you very much for the reply.

This time the distribution of k points was even. I did use $SCRATCH for
granularity 1 but got the lapw1 error as shown below in the dayfile. Still
I am not fully assured that I will not get the error in the later
calculations. For now it is ok.

Before this the same compound with RKmax 7.00 worked fine without any
errors. I did use granularity more than 1 in previous calculations for 60
IBZ k points (RKmax 7) and it converged well with the $SCRATCH directory.
Now with the scaling of RKmax to 9.00 it is always crashing. Here in the
dayfile we can see that only lapw1_1 and lapw1_2 errors are present and no
info is given for other k points. It is just the error in lapw1.

This error is from the lack of proper communication between the nodes and
its processers. I tried with various combinations of delay, sleepy and
wait but could never result in one optimum set of value. The new version
7.2 works great for small scale calculations but for larger systems it is
the problem with various processors and nodes and their synchronization,
with the system halting in the wait statement. I commented it, deleted it,
doubled it, halved it......to result in a failure. There must be something
I am missing in this.

I would be very grateful for your suggestions and any idea about fixing
the sleepy, delay and wait parameters would be welcome.

Regards,
Subin


____________________DAYFILE______________________________________________
Calculating tio2 in /scratch/14777.master/tio2
on node4 with PID 5938

    start 	(Fri Apr  6 12:42:03 EDT 2007) with lapw0 (60/20 to go)

    cycle 1 	(Fri Apr  6 12:42:03 EDT 2007) 	(60/20 to go)

>   lapw0 -p	(12:42:03) starting parallel lapw0 at Fri Apr  6 12:42:03 EDT
2007
--------
running lapw0 in single mode
2.064u 0.076s 0:02.14 99.5%	0+0k 0+0io 0pf+0w
>   lapw1  -p 	(12:42:05) starting parallel lapw1 at Fri Apr  6 12:42:05
EDT 2007
->  starting parallel LAPW1 jobs at Fri Apr  6 12:42:05 EDT 2007
Fri Apr 6 12:42:05 EDT 2007 -> Setting up case tio1 for parallel execution
Fri Apr 6 12:42:05 EDT 2007 -> of LAPW1
Fri Apr 6 12:42:05 EDT 2007 ->
Fri Apr 6 12:42:05 EDT 2007 -> non sp
running LAPW1 in parallel mode (using .machines)
Granularity set to 1
Extrafine set
Fri Apr 6 12:42:05 EDT 2007 -> klist:       24
Fri Apr 6 12:42:05 EDT 2007 -> machines:    node4 node5 node4 node5 node4
node5 node4 node5 node4 node5 node4 node5 node4 node5 node4 node5 node4
node5 node4 node5 node4 node5 node4 node5
Fri Apr 6 12:42:05 EDT 2007 -> procs:       24
Fri Apr 6 12:42:05 EDT 2007 -> weigh(old):  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
Fri Apr 6 12:42:05 EDT 2007 -> sumw:        24
Fri Apr 6 12:42:05 EDT 2007 -> granularity: 1
Fri Apr 6 12:42:05 EDT 2007 -> weigh(new):  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
Fri Apr 6 12:42:05 EDT 2007 -> Splitting tio1.klist.tmp into junks
::::::::::::::
.machinetmp222
::::::::::::::
node4
node5
node4
node5
node4
node5
node4
node5
node4
node5
node4
node5
node4
node5
node4
node5
node4
node5
node4
node5
node4
node5
node4
node5
.machinetmp222
24 number_of_parallel_jobs
prepare 1 on node4
Fri Apr 6 12:42:05 EDT 2007 -> Creating klist 1
1 : 1k (node4, 1)
Fri Apr 6 12:42:05 EDT 2007 ->
Fri Apr 6 12:42:05 EDT 2007 -> creating lapw1_1.def:
prepare 2 on node5
Fri Apr 6 12:42:09 EDT 2007 -> Creating klist 2
2 : 1k (node5, 1)
Fri Apr 6 12:42:09 EDT 2007 ->
Fri Apr 6 12:42:09 EDT 2007 -> creating lapw1_2.def:
prepare 3 on node4
Fri Apr 6 12:42:13 EDT 2007 -> Creating klist 3
3 : 1k (node4, 1)
Fri Apr 6 12:42:13 EDT 2007 ->
Fri Apr 6 12:42:13 EDT 2007 -> creating lapw1_3.def:
prepare 4 on node5
Fri Apr 6 12:42:17 EDT 2007 -> Creating klist 4
4 : 1k (node5, 1)
Fri Apr 6 12:42:17 EDT 2007 ->
Fri Apr 6 12:42:17 EDT 2007 -> creating lapw1_4.def:
prepare 5 on node4
Fri Apr 6 12:42:21 EDT 2007 -> Creating klist 5
5 : 1k (node4, 1)
Fri Apr 6 12:42:21 EDT 2007 ->
Fri Apr 6 12:42:21 EDT 2007 -> creating lapw1_5.def:
prepare 6 on node5
Fri Apr 6 12:42:25 EDT 2007 -> Creating klist 6
6 : 1k (node5, 1)
Fri Apr 6 12:42:25 EDT 2007 ->
Fri Apr 6 12:42:25 EDT 2007 -> creating lapw1_6.def:
prepare 7 on node4
Fri Apr 6 12:42:29 EDT 2007 -> Creating klist 7
7 : 1k (node4, 1)
Fri Apr 6 12:42:29 EDT 2007 ->
Fri Apr 6 12:42:29 EDT 2007 -> creating lapw1_7.def:
prepare 8 on node5
Fri Apr 6 12:42:33 EDT 2007 -> Creating klist 8
8 : 1k (node5, 1)
Fri Apr 6 12:42:33 EDT 2007 ->
Fri Apr 6 12:42:33 EDT 2007 -> creating lapw1_8.def:
prepare 9 on node4
Fri Apr 6 12:42:37 EDT 2007 -> Creating klist 9
9 : 1k (node4, 1)
Fri Apr 6 12:42:37 EDT 2007 ->
Fri Apr 6 12:42:37 EDT 2007 -> creating lapw1_9.def:
prepare 10 on node5
Fri Apr 6 12:42:41 EDT 2007 -> Creating klist 10
10 : 1k (node5, 1)
Fri Apr 6 12:42:41 EDT 2007 ->
Fri Apr 6 12:42:41 EDT 2007 -> creating lapw1_10.def:
prepare 11 on node4
Fri Apr 6 12:42:45 EDT 2007 -> Creating klist 11
11 : 1k (node4, 1)
Fri Apr 6 12:42:45 EDT 2007 ->
Fri Apr 6 12:42:45 EDT 2007 -> creating lapw1_11.def:
prepare 12 on node5
Fri Apr 6 12:42:49 EDT 2007 -> Creating klist 12
12 : 1k (node5, 1)
Fri Apr 6 12:42:49 EDT 2007 ->
Fri Apr 6 12:42:49 EDT 2007 -> creating lapw1_12.def:
prepare 13 on node4
Fri Apr 6 12:42:53 EDT 2007 -> Creating klist 13
13 : 1k (node4, 1)
Fri Apr 6 12:42:54 EDT 2007 ->
Fri Apr 6 12:42:54 EDT 2007 -> creating lapw1_13.def:
prepare 14 on node5
Fri Apr 6 12:42:58 EDT 2007 -> Creating klist 14
14 : 1k (node5, 1)
Fri Apr 6 12:42:58 EDT 2007 ->
Fri Apr 6 12:42:58 EDT 2007 -> creating lapw1_14.def:
prepare 15 on node4
Fri Apr 6 12:43:02 EDT 2007 -> Creating klist 15
15 : 1k (node4, 1)
Fri Apr 6 12:43:02 EDT 2007 ->
Fri Apr 6 12:43:02 EDT 2007 -> creating lapw1_15.def:
prepare 16 on node5
Fri Apr 6 12:43:06 EDT 2007 -> Creating klist 16
16 : 1k (node5, 1)
Fri Apr 6 12:43:06 EDT 2007 ->
Fri Apr 6 12:43:06 EDT 2007 -> creating lapw1_16.def:
prepare 17 on node4
Fri Apr 6 12:43:10 EDT 2007 -> Creating klist 17
17 : 1k (node4, 1)
Fri Apr 6 12:43:10 EDT 2007 ->
Fri Apr 6 12:43:10 EDT 2007 -> creating lapw1_17.def:
prepare 18 on node5
Fri Apr 6 12:43:14 EDT 2007 -> Creating klist 18
18 : 1k (node5, 1)
Fri Apr 6 12:43:14 EDT 2007 ->
Fri Apr 6 12:43:14 EDT 2007 -> creating lapw1_18.def:
prepare 19 on node4
Fri Apr 6 12:43:18 EDT 2007 -> Creating klist 19
19 : 1k (node4, 1)
Fri Apr 6 12:43:18 EDT 2007 ->
Fri Apr 6 12:43:18 EDT 2007 -> creating lapw1_19.def:
prepare 20 on node5
Fri Apr 6 12:43:22 EDT 2007 -> Creating klist 20
20 : 1k (node5, 1)
Fri Apr 6 12:43:22 EDT 2007 ->
Fri Apr 6 12:43:22 EDT 2007 -> creating lapw1_20.def:
prepare 21 on node4
Fri Apr 6 12:43:26 EDT 2007 -> Creating klist 21
21 : 1k (node4, 1)
Fri Apr 6 12:43:26 EDT 2007 ->
Fri Apr 6 12:43:26 EDT 2007 -> creating lapw1_21.def:
prepare 22 on node5
Fri Apr 6 12:43:30 EDT 2007 -> Creating klist 22
22 : 1k (node5, 1)
Fri Apr 6 12:43:30 EDT 2007 ->
Fri Apr 6 12:43:30 EDT 2007 -> creating lapw1_22.def:
prepare 23 on node4
Fri Apr 6 12:43:34 EDT 2007 -> Creating klist 23
23 : 1k (node4, 1)
Fri Apr 6 12:43:34 EDT 2007 ->
Fri Apr 6 12:43:34 EDT 2007 -> creating lapw1_23.def:
prepare 24 on node5
Fri Apr 6 12:43:38 EDT 2007 -> Creating klist 24
24 : 1k (node5, 1)
Fri Apr 6 12:43:38 EDT 2007 ->
Fri Apr 6 12:43:38 EDT 2007 -> creating lapw1_24.def:
waiting for all processes to complete
Fri Apr 6 12:43:50 EDT 2007 -> all processes done.
testerror lapw1_1
testerror lapw1_2
**  LAPW1 crashed!
0.396u 0.747s 1:53.09 0.9%	0+0k 0+0io 0pf+0w
error: command   /usr/opt/WIEN2k_7/lapw1para lapw1.def   failed

>   stop error

_________________________________________________________________________

>
> Probably you misunderstood the answer. With your granularity:4 you
> cannot use $SCRATCH (that's why you ran into troubles). By putting
> granularity:1 you can continue to use $SCRATCH, just the load
> balancing of your system might not be optimal.
>
> Stefaan
>
>
>
> Quoting jadhikari at clarku.edu:
>
>> Prof. P Blaha,
>>
>> Thank you very much for the answer.
>>
>> We have distributed memory system in our cluster with a master node and
>> about 48 child nodes with 2 processers each. We use $SCRATCH of the
>> nodes
>> so that no jobs run on the master node (master node alone is not
>> allowed).
>>
>> I have no idea about running jobs on the nodes without using $SCRATCH
>> space. I will be very grateful for any suggestion regarding this.
>>
>> Waiting for the reply.
>> Subin
>>
>>
>>
>>> Put    granularity:1
>>> This will evenly distribute the k-points at once (and not one after the
>>> other, which is usefull for load ballencing, but you cannot use
>>> $SCRATCH)
>>>
>>>
>>>
>>> jadhikari at clarku.edu schrieb:
>>>> Dear Wien users,
>>>>
>>>> I have a question concerning k point distribution.
>>>> 24 IBZ points are evenly distributed to 3 nodes and 6 processers with
>>>> each
>>>> processor getting 4 IBZ points as shown below. (1 node has 2
>>>> processers)
>>>>
>>>> But this is not the situation always. Sometimes the no of k points
>>>> that
>>>> one processor get is more than that of others. And the system always
>>>> crashes if this happens.
>>>>
>>>> Is there a way to control this inhomogeneity? All the processers are
>>>> of
>>>> equal speed. The .machines file is shown at the end.
>>>>
>>>> Thank you.
>>>>
>>>> Subin
>>>>
>>>> _________________________________________________________________________
>>>>      node2(1) 9.176u 0.092s 0:09.28 99.7%       0+0k 0+0io 0pf+0w
>>>>      node5(1) 9.715u 0.118s 0:10.50 93.5%       0+0k 0+0io 0pf+0w
>>>>      node9(1) 9.754u 0.130s 0:11.75 84.0%       0+0k 0+0io 0pf+0w
>>>>      node2(1) 10.918u 0.112s 0:17.80 61.9%      0+0k 0+0io 0pf+0w
>>>>      node5(1) 9.453u 0.114s 0:11.28 84.7%       0+0k 0+0io 0pf+0w
>>>>      node9(1) 9.995u 0.117s 0:13.79 73.2%       0+0k 0+0io 0pf+0w
>>>>      node2(1) 9.286u 0.095s 0:09.40 99.6%       0+0k 0+0io 0pf+0w
>>>>      node5(1) 11.702u 0.115s 0:12.99 90.9%      0+0k 0+0io 0pf+0w
>>>>      node9(1) 9.336u 0.110s 0:16.29 57.9%       0+0k 0+0io 0pf+0w
>>>>      node2(1) 9.403u 0.111s 0:15.62 60.8%       0+0k 0+0io 0pf+0w
>>>>      node5(1) 11.607u 0.116s 0:15.94 73.4%      0+0k 0+0io 0pf+0w
>>>>      node9(1) 9.595u 0.119s 0:13.52 71.7%       0+0k 0+0io 0pf+0w
>>>>      node2(1) 9.207u 0.112s 0:10.64 87.5%       0+0k 0+0io 0pf+0w
>>>>      node5(1) 11.135u 0.124s 0:14.81 75.9%      0+0k 0+0io 0pf+0w
>>>>      node9(1) 9.985u 0.114s 0:16.91 59.6%       0+0k 0+0io 0pf+0w
>>>>      node2(1) 10.602u 0.118s 0:18.33 58.4%      0+0k 0+0io 0pf+0w
>>>>      node5(1) 11.476u 0.106s 0:16.98 68.1%      0+0k 0+0io 0pf+0w
>>>>      node9(1) 9.325u 0.100s 0:13.75 68.5%       0+0k 0+0io 0pf+0w
>>>>      node2(1) 9.447u 0.109s 0:10.03 95.1%       0+0k 0+0io 0pf+0w
>>>>      node5(1) 9.997u 0.115s 0:11.08 91.1%       0+0k 0+0io 0pf+0w
>>>>      node9(1) 10.821u 0.119s 0:19.06 57.3%      0+0k 0+0io 0pf+0w
>>>>      node2(1) 9.400u 0.097s 0:13.84 68.5%       0+0k 0+0io 0pf+0w
>>>>      node5(1) 11.749u 0.130s 0:17.38 68.2%      0+0k 0+0io 0pf+0w
>>>>      node9(1) 9.436u 0.112s 0:12.45 76.6%       0+0k 0+0io 0pf+0w
>>>>    Summary of lapw1para:
>>>>    node2         k=8     user=77.439     wallclock=104.94
>>>>    node5         k=8     user=86.834     wallclock=110.96
>>>>    node9         k=8     user=78.247     wallclock=117.52
>>>>    node2         k=8     user=77.439     wallclock=104.94
>>>>    node5         k=8     user=86.834     wallclock=110.96
>>>>    node9         k=8     user=78.247     wallclock=117.52
>>>> _________________________________________________________
>>>> .machine file
>>>>
>>>> 1:node2
>>>> 1:node5
>>>> 1:node9
>>>> 1:node2
>>>> 1:node5
>>>> 1:node9
>>>> granularity:4
>>>> extrafine:1
>>>>
>>>> _______________________________________________
>>>> Wien mailing list
>>>> Wien at zeus.theochem.tuwien.ac.at
>>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>> _______________________________________________
>>> Wien mailing list
>>> Wien at zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>>
>>>
>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>>
>
>
>
> --
> Stefaan Cottenier
> Instituut voor Kern- en Stralingsfysica
> K.U.Leuven
> Celestijnenlaan 200 D
> B-3001 Leuven (Belgium)
>
> tel: + 32 16 32 71 45
> fax: + 32 16 32 79 85
> e-mail: stefaan.cottenier at fys.kuleuven.be
>
>
> Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
>



More information about the Wien mailing list