[Wien] lapw1para error while running k-point parallel calculation
ROBERTO LUIS IGLESIAS PASTRANA
roberto at uniovi.es
Thu Nov 27 13:57:35 CET 2008
Hello all!
Iḿ trying to set k-point parallelism up and running in my computer, which has an Intel (R) Core(TM)2 Quad Q9300 @2.50GHz CPU, runs Ubuntu 8.10, using ifort 11.0.069 and mkl libraries 10.1.0.015, and Wien2k_08.3 version. I tried it first with test-case from the benchmarking Wien2k web page. I wanted to do a benchmarking such as the one in the thread starting from:
http://zeus.theochem.tuwien.ac.at/pipermail/wien/2008-August/011238.html
I wrote the following .machines file for my 4 processors:
granularity:1
1:localhost
1:localhost
1:localhost
1:localhost
extrafine:1
When running x lapw1 -p I get the following error:
titin at titin-desktop:~/Programas/WIEN2k/titin/benchmark/test_case$ x lapw1 -p
starting parallel lapw1 at jue nov 27 13:33:33 CET 2008
-> starting parallel LAPW1 jobs at jue nov 27 13:33:33 CET 2008
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
[1] 12778
bash: lapw1c: command not found
bash: fixerror_lapw: command not found
[1] Done ( ( $remote $machine[$p] ...
localhost(1) 0.000u 0.000s 0.00 0.00% 0+0k 0+0io 0pf+0w
** LAPW1 crashed!
cat: No match.
0.100u 0.160s 0:02.97 8.7% 0+0k 0+248io 0pf+0w
error: command /home/titin/Programas/WIEN2k/lapw1cpara -c lapw1.def failed
Digging in Wien2k ML files, I did not find any problem exactly as mine. There were some posts regarding the correct linking in WIEN2k ROOT directory, therefore I checked:
titin at titin-desktop:~/Programas/WIEN2k$ ls -alsp lapw1*
11596 -rwxr-xr-x 1 titin titin 11857076 2008-11-20 19:18 lapw1
11492 -rwxr-xr-x 1 titin titin 11747349 2008-11-20 19:18 lapw1c
0 lrwxrwxrwx 1 titin titin 9 2008-11-18 19:24 lapw1cpara -> lapw1para
0 lrwxrwxrwx 1 titin titin 14 2008-11-18 19:24 lapw1para -> lapw1para_lapw
20 -rwxr-xr-x 1 titin titin 16661 2008-11-18 19:24 lapw1para_lapw
I think this means the links to the parallel versions are OK, doesn't it?
I also thought the problem may be due to the fact that test_case had only one k-point in its *.klist file, as suggested by Peter in the above mentioned thread
http://zeus.theochem.tuwien.ac.at/pipermail/wien/2008-August/011266.html
Then I decided to try for a bccFe unit cell. The error was multiplied by 4 in this case:
titin at titin-desktop:~/Programas/WIEN2k/titin/benchmark/bccFe$ x lapw0 -p
starting parallel lapw0 at jue nov 27 13:11:34 CET 2008
-------- .machine0 : processors
running lapw0 in single mode
LAPW0 END
1.448u 0.108s 0:01.55 99.3% 0+0k 16+448io 0pf+0w
titin at titin-desktop:~/Programas/WIEN2k/titin/benchmark/bccFe$ x lapw1 -p
starting parallel lapw1 at jue nov 27 13:11:52 CET 2008
-> starting parallel LAPW1 jobs at jue nov 27 13:11:52 CET 2008
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
[1] 12297
[2] 12317
[3] 12337
bash: lapw1: command not found
bash: fixerror_lapw: command not found
bash: lapw1:command not found
bash: fixerror_lapw: command not found
[2] - Done ( ( $remote $machine[$p] ...
[1] - Done ( ( $remote $machine[$p] ...
[4] 12401
bash: lapw1: command not found
bash: fixerror_lapw: command not found
bash: lapw1: command not found
bash: fixerror_lapw:command not found
[4] - Done ( ( $remote $machine[$p] ...
[3] + Done ( ( $remote $machine[$p] ...
[1] 12466
[2] 12486
bash: lapw1: command not found
bash: fixerror_lapw: command not found
[1] - Done ( ( $remote $machine[$p] ...
bash: lapw1: command not found
bash: fixerror_lapw: command not found
[2] Done ( ( $remote $machine[$p] ...
localhost(62) 0.000u 0.000s 0.00 0.00% 0+0k 0+0io 0pf+0w
localhost(62) 0.000u 0.000s 0.00 0.00% 0+0k 0+0io 0pf+0w
localhost(62) 0.000u 0.000s 0.00 0.00% 0+0k 0+0io 0pf+0w
localhost(62) 0.000u 0.000s 0.00 0.00% 0+0k 0+0io 0pf+0w
localhost(1) 0.000u 0.000s 0.00 0.00% 0+0k 0+0io 0pf+0w
localhost(1) 0.004u 0.000s 0.00 400.00% 0+0k 0+0io 0pf+0w
** LAPW1 crashed!
cat: No match.
0.276u 0.228s 0:10.02 4.8% 0+0k 128+992io 1pf+0w
error: command /home/titin/Programas/WIEN2k/lapw1para lapw1.def failed
Could this have something to do with communication between the four CPUs? I first thought it could be due to passwordless ssh login failure, but issuing:
titin at titin-desktop:~$ ssh titin-desktop
Linux titin-desktop 2.6.27-10-generic #1 SMP Fri Nov 21 19:19:18 UTC 2008 x86_64
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
To access official Ubuntu documentation, please visit:
http://help.ubuntu.com/
Last login: Thu Nov 27 13:07:11 2008 from localhost
seems to get through correctly.
Maybe I'm asking something rather trivial, but I can't find a solution. Does somebody have any idea? I would be very glad to welcome suggestions. Please don't hesitate to let me know if you need some other infos.
Have a nice day!
Roberto
More information about the Wien
mailing list