[Wien] Cannot run kpoint parallel jobs - only serial. _An offer to developers_

Thu May 30 11:25:52 CEST 2013

Am 30.05.2013 10:00, schrieb Lyudmila Dobysheva:
> 29.05.2013 23:58, Robert Nichol wrote:
>> If I submit the script for k-point parallelization lapw2 to crashes.
>> contents of  a case.dayfile
>   n0523(1) 0.094u 0.014s 0.12 84.38%      0+0k 0+0io 0pf+0w
>     n0523     k=11     user=0.88     wallclock=622.038
> 0.974u 3.769s 0:07.71 61.3%    0+0k 424+11064io 4pf+0w
>
> Dear Robert,
>
> It looks like lapw1 does not work at all, due to wrong setting in the file parallel_options
>
> There should be:
> setenv USE_REMOTE 0

No, don't do this. USE_REMOTE=0 is only for a single shared memory machine (you cannot come from node1 to node 2 without it).

lapw1 was working (at least most of the time), but such a heavy parallelization is not supported by your system.

You wrote:
not only is case.energy_11 missing, but so is case.energy12 and every 12th case.energy file after those (11/12/23/24/35/36/47 are all missing.)

With your machines file (47 line, always 12 lines to the same node) it indicates that your system (either PBS or Linux setup)
allows to perform 10 (ssh nodeXX lapw1 lapw1_YY.def) jobs, but not 12, as it would be necessary for a complete run.

As I said before, you must tailor your parallelization to your case.
Such a small system with 47 k-points can probably run efficiently only on a few cores (as suggested in the machines file
by L.Marks).

In case you want to use all 12 cores of one node:
Set an environment variable   OMP_NUM_THREAD=2    (see your .bashrc file or in the PBS script)
and 6 lines in .machines:

8:node1
8:node1
8:node1
8:node1
8:node1
7:node1

A higher parallelization is definitely meaningless for this case.

>
> Two months ago, we have already had a letter with this problem here in the mailing list. ("error in lapw2 - parallel"  of Mar 22 2013)
> I'd like to suggest to developers to look why error files are empty when lapw1 has not actually worked.
> Maybe creation of the nonzero error file should be moved to an earlier place in lapw1para. Now, when lapw1para fails due to this wrong option of setenv, the nonzero error
> files are still not created, there exist in the directory old zero error files. totalexec checks testerror and thinks that everything is ok and goes to lapw2.

This is not so easy. The error files are created by lapw1. When lapw1 has never been started, it
cannot create error files.
Of course, most linux command also set a  "status" variable, which could be used to find out if the
"ssh node xx lapw1 lapw1_1.def"   command was successful. But this is dangerous, because even unimportant
small problems (like the correct format of analyzing the output of the time-command) could indicate
a failure (although the actual lapw1 calculation was fine)....

> Best regards
>    Lyudmila Dobysheva
> ------------------------------------------------------------------
> Phys.-Techn. Institute of Ural Br. of Russian Ac. of Sci.
> 426001 Izhevsk, ul.Kirova 132
> RUSSIA
> ------------------------------------------------------------------
> Tel.:7(3412) 442118 (home), 218988(office), 722529(Fax)
> E-mail: lyu at ftiudm.ru
>          lyuka17 at mail.ru (office) lyuka17 at gmail.com (home)
> Skype:  lyuka17 (home), lyuka18 (office)
> http://fti.udm.ru/content/view/25/103/lang,english/
> ------------------------------------------------------------------
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

-- 
-----------------------------------------
Peter Blaha
Inst. Materials Chemistry, TU Vienna
Getreidemarkt 9, A-1060 Vienna, Austria
Tel: +43-1-5880115671
Fax: +43-1-5880115698
email: pblaha at theochem.tuwien.ac.at
-----------------------------------------