[Wien] lapw1para error while running k-point parallel calculation

ROBERTO LUIS IGLESIAS PASTRANA roberto at uniovi.es
Thu Nov 27 13:57:35 CET 2008


Hello all!

Iḿ trying to set k-point parallelism up and running in my computer, which has an Intel (R) Core(TM)2 Quad Q9300 @2.50GHz CPU, runs Ubuntu 8.10, using  ifort 11.0.069 and mkl libraries 10.1.0.015, and Wien2k_08.3 version. I tried it first with test-case from the benchmarking Wien2k web page. I wanted to do a benchmarking such as the one in the thread starting from:

http://zeus.theochem.tuwien.ac.at/pipermail/wien/2008-August/011238.html

I wrote the following .machines file for my 4 processors:

granularity:1
1:localhost
1:localhost
1:localhost
1:localhost
extrafine:1

When  running x lapw1 -p I get the following error:

titin at titin-desktop:~/Programas/WIEN2k/titin/benchmark/test_case$ x lapw1 -p
starting parallel lapw1 at jue nov 27 13:33:33 CET 2008
->  starting parallel LAPW1 jobs at jue nov 27 13:33:33 CET 2008
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
[1] 12778
bash: lapw1c: command not found
bash: fixerror_lapw: command not found
[1]    Done                          ( ( $remote $machine[$p]  ...
     localhost(1) 0.000u 0.000s 0.00 0.00%      0+0k 0+0io 0pf+0w
**  LAPW1 crashed!
cat: No match.
0.100u 0.160s 0:02.97 8.7%	0+0k 0+248io 0pf+0w
error: command   /home/titin/Programas/WIEN2k/lapw1cpara -c lapw1.def   failed

Digging in Wien2k ML files, I did not find any problem exactly as mine. There were some posts regarding the correct linking in WIEN2k ROOT directory, therefore I checked:

titin at titin-desktop:~/Programas/WIEN2k$ ls -alsp lapw1*
11596 -rwxr-xr-x 1 titin titin 11857076 2008-11-20 19:18 lapw1
11492 -rwxr-xr-x 1 titin titin 11747349 2008-11-20 19:18 lapw1c
    0 lrwxrwxrwx 1 titin titin        9 2008-11-18 19:24 lapw1cpara -> lapw1para
    0 lrwxrwxrwx 1 titin titin       14 2008-11-18 19:24 lapw1para -> lapw1para_lapw
   20 -rwxr-xr-x 1 titin titin    16661 2008-11-18 19:24 lapw1para_lapw

I think this means the links to the parallel versions are OK, doesn't it?

I also thought the problem may be due to the fact that test_case had only one k-point in its *.klist file, as suggested by Peter in the above mentioned thread

http://zeus.theochem.tuwien.ac.at/pipermail/wien/2008-August/011266.html

Then I decided to try for a bccFe unit cell. The error was multiplied by 4 in this case:

titin at titin-desktop:~/Programas/WIEN2k/titin/benchmark/bccFe$ x lapw0 -p
starting parallel lapw0 at jue nov 27 13:11:34 CET 2008
-------- .machine0 : processors

running lapw0 in single mode
 LAPW0 END
1.448u 0.108s 0:01.55 99.3%	0+0k 16+448io 0pf+0w
titin at titin-desktop:~/Programas/WIEN2k/titin/benchmark/bccFe$ x lapw1 -p
starting parallel lapw1 at jue nov 27 13:11:52 CET 2008
->  starting parallel LAPW1 jobs at jue nov 27 13:11:52 CET 2008
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
[1] 12297
[2] 12317
[3] 12337
bash: lapw1: command not found
bash: fixerror_lapw: command not found
bash: lapw1:command not found
bash: fixerror_lapw: command not found
[2]  - Done                          ( ( $remote $machine[$p]  ...
[1]  - Done                          ( ( $remote $machine[$p]  ...
[4] 12401
bash: lapw1: command not found
bash: fixerror_lapw: command not found
bash: lapw1: command not found
bash: fixerror_lapw:command not found
[4]  - Done                          ( ( $remote $machine[$p]  ...
[3]  + Done                          ( ( $remote $machine[$p]  ...
[1] 12466
[2] 12486
bash: lapw1: command not found
bash: fixerror_lapw: command not found
[1]  - Done                          ( ( $remote $machine[$p]  ...
bash: lapw1: command not found
bash: fixerror_lapw: command not found
[2]    Done                          ( ( $remote $machine[$p]  ...
     localhost(62) 0.000u 0.000s 0.00 0.00%      0+0k 0+0io 0pf+0w
     localhost(62) 0.000u 0.000s 0.00 0.00%      0+0k 0+0io 0pf+0w
     localhost(62) 0.000u 0.000s 0.00 0.00%      0+0k 0+0io 0pf+0w
     localhost(62) 0.000u 0.000s 0.00 0.00%      0+0k 0+0io 0pf+0w
     localhost(1) 0.000u 0.000s 0.00 0.00%      0+0k 0+0io 0pf+0w
     localhost(1) 0.004u 0.000s 0.00 400.00%      0+0k 0+0io 0pf+0w
**  LAPW1 crashed!
cat: No match.
0.276u 0.228s 0:10.02 4.8%	0+0k 128+992io 1pf+0w
error: command   /home/titin/Programas/WIEN2k/lapw1para lapw1.def   failed

Could this have something to do with communication between the four CPUs? I first thought it could be due to passwordless ssh login failure, but issuing:

titin at titin-desktop:~$ ssh titin-desktop
Linux titin-desktop 2.6.27-10-generic #1 SMP Fri Nov 21 19:19:18 UTC 2008 x86_64

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

To access official Ubuntu documentation, please visit:
http://help.ubuntu.com/
Last login: Thu Nov 27 13:07:11 2008 from localhost

seems to get through correctly.

Maybe I'm asking something rather trivial, but I can't find a solution. Does somebody have any idea? I would be very glad to welcome suggestions. Please don't hesitate to let me know if you need some other infos.

Have a nice day!

Roberto


More information about the Wien mailing list