[Wien] lapwso

Mon Aug 25 08:44:35 CEST 2003

It seems the calculation worked, but then either the script lapwsopara has
problems (because of 40 parallel jobs and some possible overrun of shell
variables limits ??) or NFS problems.

>       longhorn1 132.0u 1.1s 1:11:07 3% 245+161283k 0+0io 22791pf+0w

This line tells you, that the actual cpu time needed was only 130 seconds,
however, it took 70 minutes to finish this step.

When you see something like this you should start worrying! On ONE !!!
single PC the calculation would also take about 70 minutes, so you do NOT
need a supercomputer and 40 processors.

Possible reasons:
- The IBM is completely overcrowded ?? No queueing system ??
- The NFS to the fileserver is completely overloaded !! (My guess)

Is this the general behaviour in all job steps ? (Check the timing of the
dayfile). If yes: contact your systemadministrator for help.

If the problem is NFS, it may run sometimes, sometimes it will crash,...

If the problem is in the script: put "-xf" in the first line of lapwsopara
to see where it crashes.

In either case I'd expect better behaviour when you use less processors.
Usually it is recommended to have a little longer jobsteps than those 130
seconds/node. It reduces I/O compared to cpu usage.

> the calculation uses 40 k-pt and 40 processors(one k-pt on each
> processor).each processor has memory 2 GB.
>
> In the first iteration itself, I get error in lapwso -up  -p I get the
> error in terminal output(I have shown only few steps and not 40 of them
> here)
>
> STOP LAPWSO END
> STOP LAPWSO END
> STOP LAPWSO END
> STOP LAPWSO END
> STOP LAPWSO END
> STOP LAPWSO END
> STOP LAPWSO END
> Subscript out of range
>
> and in the case.dayfile(here also I have shown few of the steps)
>
>
> lapwso -up  -p      (07:51:57) running LAPWSO in parallel mode
>       longhorn1 134.1u 1.3s 1:13:18 3% 244+158485k 0+0io 23902pf+0w
>       longhorn1 147.7u 1.3s 1:14:40 3% 241+144000k 0+0io 24494pf+0w
>       longhorn1 128.5u 1.1s 1:14:33 2% 240+165636k 0+0io 24765pf+0w
>       longhorn1 170.0u 1.5s 1:18:43 3% 232+125144k 0+0io 26377pf+0w
>       longhorn1 179.6u 1.5s 1:21:00 3% 230+118488k 0+0io 26583pf+0w
>       longhorn1 135.9u 1.4s 1:15:53 3% 237+156214k 0+0io 25765pf+0w
>       longhorn1 158.8u 1.0s 1:20:04 3% 233+134259k 0+0io 26680pf+0w
>       longhorn1 194.4u 1.3s 1:26:01 3% 228+109671k 0+0io 27885pf+0w
>       longhorn1 167.8u 1.3s 1:25:31 3% 230+126934k 0+0io 28433pf+0w
>        stop error
>
>
>
> lapswo does write case.vectorso_*(40 for 40 processors) and
> case.outputso_*. but It stops with the above error.I checked
>
> the lapwso.error file:
>
> **  Error in Parallel LAPWSO
>
>
> I checked carefully for any possible mistake in the input files but to
> the best of my knowledge I found none.
>
> there are no other messages than the above which could indicate me for
> possible mistakes.
>
> Pl. let me know if I am doing something wrong?any suggestions?

                                      P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-15671             FAX: +43-1-58801-15698
Email: blaha at theochem.tuwien.ac.at    WWW: http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------