[Wien] running jobs without $SCRATCH

Laurence Marks L-marks at northwestern.edu
Fri Apr 6 23:40:54 CEST 2007


99.9999% certain this is NOT an issue with Wien2k but with how your
system is setup. You need to talk to your sysadmin.

On 4/6/07, jadhikari at clarku.edu <jadhikari at clarku.edu> wrote:
> Hi,
> Thank you very much for the reply.
>
> This time the distribution of k points was even. I did use $SCRATCH for
> granularity 1 but got the lapw1 error as shown below in the dayfile. Still
> I am not fully assured that I will not get the error in the later
> calculations. For now it is ok.
>
> Before this the same compound with RKmax 7.00 worked fine without any
> errors. I did use granularity more than 1 in previous calculations for 60
> IBZ k points (RKmax 7) and it converged well with the $SCRATCH directory.
> Now with the scaling of RKmax to 9.00 it is always crashing. Here in the
> dayfile we can see that only lapw1_1 and lapw1_2 errors are present and no
> info is given for other k points. It is just the error in lapw1.
>
> This error is from the lack of proper communication between the nodes and
> its processers. I tried with various combinations of delay, sleepy and
> wait but could never result in one optimum set of value. The new version
> 7.2 works great for small scale calculations but for larger systems it is
> the problem with various processors and nodes and their synchronization,
> with the system halting in the wait statement. I commented it, deleted it,
> doubled it, halved it......to result in a failure. There must be something
> I am missing in this.
>
> I would be very grateful for your suggestions and any idea about fixing
> the sleepy, delay and wait parameters would be welcome.
>
> Regards,
> Subin
>
>
> ____________________DAYFILE______________________________________________
> Calculating tio2 in /scratch/14777.master/tio2
> on node4 with PID 5938
>
>     start       (Fri Apr  6 12:42:03 EDT 2007) with lapw0 (60/20 to go)
>
>     cycle 1     (Fri Apr  6 12:42:03 EDT 2007)  (60/20 to go)
>
> >   lapw0 -p    (12:42:03) starting parallel lapw0 at Fri Apr  6 12:42:03 EDT
> 2007
> --------
> running lapw0 in single mode
> 2.064u 0.076s 0:02.14 99.5%     0+0k 0+0io 0pf+0w
> >   lapw1  -p   (12:42:05) starting parallel lapw1 at Fri Apr  6 12:42:05
> EDT 2007
> ->  starting parallel LAPW1 jobs at Fri Apr  6 12:42:05 EDT 2007
> Fri Apr 6 12:42:05 EDT 2007 -> Setting up case tio1 for parallel execution
> Fri Apr 6 12:42:05 EDT 2007 -> of LAPW1
> Fri Apr 6 12:42:05 EDT 2007 ->
> Fri Apr 6 12:42:05 EDT 2007 -> non sp
> running LAPW1 in parallel mode (using .machines)
> Granularity set to 1
> Extrafine set
> Fri Apr 6 12:42:05 EDT 2007 -> klist:       24
> Fri Apr 6 12:42:05 EDT 2007 -> machines:    node4 node5 node4 node5 node4
> node5 node4 node5 node4 node5 node4 node5 node4 node5 node4 node5 node4
> node5 node4 node5 node4 node5 node4 node5
> Fri Apr 6 12:42:05 EDT 2007 -> procs:       24
> Fri Apr 6 12:42:05 EDT 2007 -> weigh(old):  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1 1
> Fri Apr 6 12:42:05 EDT 2007 -> sumw:        24
> Fri Apr 6 12:42:05 EDT 2007 -> granularity: 1
> Fri Apr 6 12:42:05 EDT 2007 -> weigh(new):  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1 1
> Fri Apr 6 12:42:05 EDT 2007 -> Splitting tio1.klist.tmp into junks
> ::::::::::::::
> .machinetmp222
> ::::::::::::::
> node4
> node5
> node4
> node5
> node4
> node5
> node4
> node5
> node4
> node5
> node4
> node5
> node4
> node5
> node4
> node5
> node4
> node5
> node4
> node5
> node4
> node5
> node4
> node5
> .machinetmp222
> 24 number_of_parallel_jobs
> prepare 1 on node4
> Fri Apr 6 12:42:05 EDT 2007 -> Creating klist 1
> 1 : 1k (node4, 1)
> Fri Apr 6 12:42:05 EDT 2007 ->
> Fri Apr 6 12:42:05 EDT 2007 -> creating lapw1_1.def:
> prepare 2 on node5
> Fri Apr 6 12:42:09 EDT 2007 -> Creating klist 2
> 2 : 1k (node5, 1)
> Fri Apr 6 12:42:09 EDT 2007 ->
> Fri Apr 6 12:42:09 EDT 2007 -> creating lapw1_2.def:
> prepare 3 on node4
> Fri Apr 6 12:42:13 EDT 2007 -> Creating klist 3
> 3 : 1k (node4, 1)
> Fri Apr 6 12:42:13 EDT 2007 ->
> Fri Apr 6 12:42:13 EDT 2007 -> creating lapw1_3.def:
> prepare 4 on node5
> Fri Apr 6 12:42:17 EDT 2007 -> Creating klist 4
> 4 : 1k (node5, 1)
> Fri Apr 6 12:42:17 EDT 2007 ->
> Fri Apr 6 12:42:17 EDT 2007 -> creating lapw1_4.def:
> prepare 5 on node4
> Fri Apr 6 12:42:21 EDT 2007 -> Creating klist 5
> 5 : 1k (node4, 1)
> Fri Apr 6 12:42:21 EDT 2007 ->
> Fri Apr 6 12:42:21 EDT 2007 -> creating lapw1_5.def:
> prepare 6 on node5
> Fri Apr 6 12:42:25 EDT 2007 -> Creating klist 6
> 6 : 1k (node5, 1)
> Fri Apr 6 12:42:25 EDT 2007 ->
> Fri Apr 6 12:42:25 EDT 2007 -> creating lapw1_6.def:
> prepare 7 on node4
> Fri Apr 6 12:42:29 EDT 2007 -> Creating klist 7
> 7 : 1k (node4, 1)
> Fri Apr 6 12:42:29 EDT 2007 ->
> Fri Apr 6 12:42:29 EDT 2007 -> creating lapw1_7.def:
> prepare 8 on node5
> Fri Apr 6 12:42:33 EDT 2007 -> Creating klist 8
> 8 : 1k (node5, 1)
> Fri Apr 6 12:42:33 EDT 2007 ->
> Fri Apr 6 12:42:33 EDT 2007 -> creating lapw1_8.def:
> prepare 9 on node4
> Fri Apr 6 12:42:37 EDT 2007 -> Creating klist 9
> 9 : 1k (node4, 1)
> Fri Apr 6 12:42:37 EDT 2007 ->
> Fri Apr 6 12:42:37 EDT 2007 -> creating lapw1_9.def:
> prepare 10 on node5
> Fri Apr 6 12:42:41 EDT 2007 -> Creating klist 10
> 10 : 1k (node5, 1)
> Fri Apr 6 12:42:41 EDT 2007 ->
> Fri Apr 6 12:42:41 EDT 2007 -> creating lapw1_10.def:
> prepare 11 on node4
> Fri Apr 6 12:42:45 EDT 2007 -> Creating klist 11
> 11 : 1k (node4, 1)
> Fri Apr 6 12:42:45 EDT 2007 ->
> Fri Apr 6 12:42:45 EDT 2007 -> creating lapw1_11.def:
> prepare 12 on node5
> Fri Apr 6 12:42:49 EDT 2007 -> Creating klist 12
> 12 : 1k (node5, 1)
> Fri Apr 6 12:42:49 EDT 2007 ->
> Fri Apr 6 12:42:49 EDT 2007 -> creating lapw1_12.def:
> prepare 13 on node4
> Fri Apr 6 12:42:53 EDT 2007 -> Creating klist 13
> 13 : 1k (node4, 1)
> Fri Apr 6 12:42:54 EDT 2007 ->
> Fri Apr 6 12:42:54 EDT 2007 -> creating lapw1_13.def:
> prepare 14 on node5
> Fri Apr 6 12:42:58 EDT 2007 -> Creating klist 14
> 14 : 1k (node5, 1)
> Fri Apr 6 12:42:58 EDT 2007 ->
> Fri Apr 6 12:42:58 EDT 2007 -> creating lapw1_14.def:
> prepare 15 on node4
> Fri Apr 6 12:43:02 EDT 2007 -> Creating klist 15
> 15 : 1k (node4, 1)
> Fri Apr 6 12:43:02 EDT 2007 ->
> Fri Apr 6 12:43:02 EDT 2007 -> creating lapw1_15.def:
> prepare 16 on node5
> Fri Apr 6 12:43:06 EDT 2007 -> Creating klist 16
> 16 : 1k (node5, 1)
> Fri Apr 6 12:43:06 EDT 2007 ->
> Fri Apr 6 12:43:06 EDT 2007 -> creating lapw1_16.def:
> prepare 17 on node4
> Fri Apr 6 12:43:10 EDT 2007 -> Creating klist 17
> 17 : 1k (node4, 1)
> Fri Apr 6 12:43:10 EDT 2007 ->
> Fri Apr 6 12:43:10 EDT 2007 -> creating lapw1_17.def:
> prepare 18 on node5
> Fri Apr 6 12:43:14 EDT 2007 -> Creating klist 18
> 18 : 1k (node5, 1)
> Fri Apr 6 12:43:14 EDT 2007 ->
> Fri Apr 6 12:43:14 EDT 2007 -> creating lapw1_18.def:
> prepare 19 on node4
> Fri Apr 6 12:43:18 EDT 2007 -> Creating klist 19
> 19 : 1k (node4, 1)
> Fri Apr 6 12:43:18 EDT 2007 ->
> Fri Apr 6 12:43:18 EDT 2007 -> creating lapw1_19.def:
> prepare 20 on node5
> Fri Apr 6 12:43:22 EDT 2007 -> Creating klist 20
> 20 : 1k (node5, 1)
> Fri Apr 6 12:43:22 EDT 2007 ->
> Fri Apr 6 12:43:22 EDT 2007 -> creating lapw1_20.def:
> prepare 21 on node4
> Fri Apr 6 12:43:26 EDT 2007 -> Creating klist 21
> 21 : 1k (node4, 1)
> Fri Apr 6 12:43:26 EDT 2007 ->
> Fri Apr 6 12:43:26 EDT 2007 -> creating lapw1_21.def:
> prepare 22 on node5
> Fri Apr 6 12:43:30 EDT 2007 -> Creating klist 22
> 22 : 1k (node5, 1)
> Fri Apr 6 12:43:30 EDT 2007 ->
> Fri Apr 6 12:43:30 EDT 2007 -> creating lapw1_22.def:
> prepare 23 on node4
> Fri Apr 6 12:43:34 EDT 2007 -> Creating klist 23
> 23 : 1k (node4, 1)
> Fri Apr 6 12:43:34 EDT 2007 ->
> Fri Apr 6 12:43:34 EDT 2007 -> creating lapw1_23.def:
> prepare 24 on node5
> Fri Apr 6 12:43:38 EDT 2007 -> Creating klist 24
> 24 : 1k (node5, 1)
> Fri Apr 6 12:43:38 EDT 2007 ->
> Fri Apr 6 12:43:38 EDT 2007 -> creating lapw1_24.def:
> waiting for all processes to complete
> Fri Apr 6 12:43:50 EDT 2007 -> all processes done.
> testerror lapw1_1
> testerror lapw1_2
> **  LAPW1 crashed!
> 0.396u 0.747s 1:53.09 0.9%      0+0k 0+0io 0pf+0w
> error: command   /usr/opt/WIEN2k_7/lapw1para lapw1.def   failed
>
> >   stop error
>
> _________________________________________________________________________
>
> >
> > Probably you misunderstood the answer. With your granularity:4 you
> > cannot use $SCRATCH (that's why you ran into troubles). By putting
> > granularity:1 you can continue to use $SCRATCH, just the load
> > balancing of your system might not be optimal.
> >
> > Stefaan
> >
> >
> >
> > Quoting jadhikari at clarku.edu:
> >
> >> Prof. P Blaha,
> >>
> >> Thank you very much for the answer.
> >>
> >> We have distributed memory system in our cluster with a master node and
> >> about 48 child nodes with 2 processers each. We use $SCRATCH of the
> >> nodes
> >> so that no jobs run on the master node (master node alone is not
> >> allowed).
> >>
> >> I have no idea about running jobs on the nodes without using $SCRATCH
> >> space. I will be very grateful for any suggestion regarding this.
> >>
> >> Waiting for the reply.
> >> Subin
> >>
> >>
> >>
> >>> Put    granularity:1
> >>> This will evenly distribute the k-points at once (and not one after the
> >>> other, which is usefull for load ballencing, but you cannot use
> >>> $SCRATCH)
> >>>
> >>>
> >>>
> >>> jadhikari at clarku.edu schrieb:
> >>>> Dear Wien users,
> >>>>
> >>>> I have a question concerning k point distribution.
> >>>> 24 IBZ points are evenly distributed to 3 nodes and 6 processers with
> >>>> each
> >>>> processor getting 4 IBZ points as shown below. (1 node has 2
> >>>> processers)
> >>>>
> >>>> But this is not the situation always. Sometimes the no of k points
> >>>> that
> >>>> one processor get is more than that of others. And the system always
> >>>> crashes if this happens.
> >>>>
> >>>> Is there a way to control this inhomogeneity? All the processers are
> >>>> of
> >>>> equal speed. The .machines file is shown at the end.
> >>>>
> >>>> Thank you.
> >>>>
> >>>> Subin
> >>>>
> >>>> _________________________________________________________________________
> >>>>      node2(1) 9.176u 0.092s 0:09.28 99.7%       0+0k 0+0io 0pf+0w
> >>>>      node5(1) 9.715u 0.118s 0:10.50 93.5%       0+0k 0+0io 0pf+0w
> >>>>      node9(1) 9.754u 0.130s 0:11.75 84.0%       0+0k 0+0io 0pf+0w
> >>>>      node2(1) 10.918u 0.112s 0:17.80 61.9%      0+0k 0+0io 0pf+0w
> >>>>      node5(1) 9.453u 0.114s 0:11.28 84.7%       0+0k 0+0io 0pf+0w
> >>>>      node9(1) 9.995u 0.117s 0:13.79 73.2%       0+0k 0+0io 0pf+0w
> >>>>      node2(1) 9.286u 0.095s 0:09.40 99.6%       0+0k 0+0io 0pf+0w
> >>>>      node5(1) 11.702u 0.115s 0:12.99 90.9%      0+0k 0+0io 0pf+0w
> >>>>      node9(1) 9.336u 0.110s 0:16.29 57.9%       0+0k 0+0io 0pf+0w
> >>>>      node2(1) 9.403u 0.111s 0:15.62 60.8%       0+0k 0+0io 0pf+0w
> >>>>      node5(1) 11.607u 0.116s 0:15.94 73.4%      0+0k 0+0io 0pf+0w
> >>>>      node9(1) 9.595u 0.119s 0:13.52 71.7%       0+0k 0+0io 0pf+0w
> >>>>      node2(1) 9.207u 0.112s 0:10.64 87.5%       0+0k 0+0io 0pf+0w
> >>>>      node5(1) 11.135u 0.124s 0:14.81 75.9%      0+0k 0+0io 0pf+0w
> >>>>      node9(1) 9.985u 0.114s 0:16.91 59.6%       0+0k 0+0io 0pf+0w
> >>>>      node2(1) 10.602u 0.118s 0:18.33 58.4%      0+0k 0+0io 0pf+0w
> >>>>      node5(1) 11.476u 0.106s 0:16.98 68.1%      0+0k 0+0io 0pf+0w
> >>>>      node9(1) 9.325u 0.100s 0:13.75 68.5%       0+0k 0+0io 0pf+0w
> >>>>      node2(1) 9.447u 0.109s 0:10.03 95.1%       0+0k 0+0io 0pf+0w
> >>>>      node5(1) 9.997u 0.115s 0:11.08 91.1%       0+0k 0+0io 0pf+0w
> >>>>      node9(1) 10.821u 0.119s 0:19.06 57.3%      0+0k 0+0io 0pf+0w
> >>>>      node2(1) 9.400u 0.097s 0:13.84 68.5%       0+0k 0+0io 0pf+0w
> >>>>      node5(1) 11.749u 0.130s 0:17.38 68.2%      0+0k 0+0io 0pf+0w
> >>>>      node9(1) 9.436u 0.112s 0:12.45 76.6%       0+0k 0+0io 0pf+0w
> >>>>    Summary of lapw1para:
> >>>>    node2         k=8     user=77.439     wallclock=104.94
> >>>>    node5         k=8     user=86.834     wallclock=110.96
> >>>>    node9         k=8     user=78.247     wallclock=117.52
> >>>>    node2         k=8     user=77.439     wallclock=104.94
> >>>>    node5         k=8     user=86.834     wallclock=110.96
> >>>>    node9         k=8     user=78.247     wallclock=117.52
> >>>> _________________________________________________________
> >>>> .machine file
> >>>>
> >>>> 1:node2
> >>>> 1:node5
> >>>> 1:node9
> >>>> 1:node2
> >>>> 1:node5
> >>>> 1:node9
> >>>> granularity:4
> >>>> extrafine:1
> >>>>
> >>>> _______________________________________________
> >>>> Wien mailing list
> >>>> Wien at zeus.theochem.tuwien.ac.at
> >>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >>> _______________________________________________
> >>> Wien mailing list
> >>> Wien at zeus.theochem.tuwien.ac.at
> >>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >>>
> >>>
> >>
> >> _______________________________________________
> >> Wien mailing list
> >> Wien at zeus.theochem.tuwien.ac.at
> >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >>
> >>
> >
> >
> >
> > --
> > Stefaan Cottenier
> > Instituut voor Kern- en Stralingsfysica
> > K.U.Leuven
> > Celestijnenlaan 200 D
> > B-3001 Leuven (Belgium)
> >
> > tel: + 32 16 32 71 45
> > fax: + 32 16 32 79 85
> > e-mail: stefaan.cottenier at fys.kuleuven.be
> >
> >
> > Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm
> >
> > _______________________________________________
> > Wien mailing list
> > Wien at zeus.theochem.tuwien.ac.at
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >
> >
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>


-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
EMM2007 http://ns.crys.ras.ru/EMMM07/
Commission on Electron Diffraction of IUCR
www.numis.northwestern.edu/IUCR_CED


More information about the Wien mailing list