[Wien] lapw1para error while running k-point parallel calculation
ROBERTO LUIS IGLESIAS PASTRANA
roberto at uniovi.es
Fri Nov 28 10:19:53 CET 2008
Hello again!
I made some more attempts on my own. Again
titin at titin-desktop:~$ ssh titin-desktop;cd ~/Programas/WIEN2k/titin/benchmark/test_case;lapw1 lapw1.def
logs in. After a while I type
$exit
logout
Connection to titin-desktop closed
..........I leave it there, where before I thought it was just hanging
LAPW1 END!!!!!!!!!!!!!!!!!!!!!!
Inside of test_case the same files appear as in the serial case test, with the exception of :log, which is missing, and the presence of test_case.vector, which was not there before, my SCRATCH is supposed to be /tmp, but here it did not use it, I don't know why, and kept it within the working dir. I don't even know if this is again a serial calculation.
Then I typed:
$grep HORB *output1*
test_case.output1: TIME HAMILT (CPU) = 13.3, HNS = 7.6, HORB = 0.0, DIAG = 28.2
In thenormal serial case this was:
test_case.output1: TIME HAMILT (CPU) = 17.6, HNS = 12.7, HORB = 0.0, DIAG = 95.8
I compared the results with the files *output1_mkl8_* present in the test_case folder as well:
$tail -5 test_case.output1_mkl8_1proc
NUMBER OF K-POINTS: 1
===> TOTAL CPU TIME: 153.6 (INIT = 1.5 + K-POINTS = 152.1)
> SUM OF WALL CLOCK TIMES: 161.3 (INIT = 1.9 + K-POINTS = 159.4)
Maximum WALL clock time: 162.053758144379
Maximum CPU time: 153.720000000000
$ tail -5 test_case.output1_mkl8_2proc
NUMBER OF K-POINTS: 1
===> TOTAL CPU TIME: 118.5 (INIT = 1.5 + K-POINTS = 116.9)
> SUM OF WALL CLOCK TIMES: 124.3 (INIT = 1.8 + K-POINTS = 122.5)
Maximum WALL clock time: 124.882931947708
Maximum CPU time: 118.560000000000
$ tail -5 test_case.output1
NUMBER OF K-POINTS: 1
===> TOTAL CPU TIME: 50.0 (INIT = 1.0 + K-POINTS = 49.0)
> SUM OF WALL CLOCK TIMES: 50.4 (INIT = 1.1 + K-POINTS = 49.3)
Maximum WALL clock time: 50.5337390899658
Maximum CPU time: 50.0800000000000
Now it takes less than 1 minute and less than half the CPU time.
Therefore I decided to go back again to bccFe in order to check the times when more than 1 k-point is used.
Therefore I issued
$ssh titin-desktop
Normal log in
$cd ~/Programas/WIEN2k/titin/benchmark/bccFe
$runsp_lapw -i 200 -ec 0.00001 -cc 0.0001 -p
LAPW0 END
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
cat: No match.
> stop error
where "orden no encontrada" means "command not found" in Spanish.
Therefore I typed:
$~$ ssh titin-desktop;cd ~/Programas/WIEN2k/titin/benchmark/bccFe_parallel/1_Thread/bccFe;runsp_lapw -i 200 -ec 0.00001 -cc 0.0001 -p
It logged in, left it for about 5 minutes (the serial calculation took a little bit more than 3) and
$exit
logout
Connection to titin-desktop closed.
LAPW0 END
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
cat: No match.
> stop error
AGAIN!
This seems to imply that the actual calculation takes place (wrongly) only after I log out.
Sorry if I am asking stupid questions, but I never had access to a multiple CPU machine up to now.
Furthermore, for the test_case again:
~$ ssh titin-desktop;cd ~/Programas/WIEN2k/titin/benchmark/test_case;x lapw1 -p
logs in
$ exit
logout
Connection to titin-desktop closed.
starting parallel lapw1 at vie nov 28 10:15:57 CET 2008
-> starting parallel LAPW1 jobs at vie nov 28 10:15:57 CET 2008
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
[1] 26958
bash: lapw1c: orden no encontrada
bash: fixerror_lapw: orden no encontrada
[1] Done ( ( $remote $machine[$p] ...
localhost(1) 0.000u 0.000s 0.00 0.00% 0+0k 0+0io 0pf+0w
** LAPW1 crashed!
cat: No match.
0.116u 0.128s 0:03.21 7.1% 0+0k 0+248io 0pf+0w
error: command /home/titin/Programas/WIEN2k/lapw1cpara -c lapw1.def failed
The same error again!! It seems something is wrong with parallelization or the -p switch.
Thanks for your patience!!
Kind regards
Roberto
----- Mensaje original -----
De: ROBERTO LUIS IGLESIAS PASTRANA <roberto at uniovi.es>
Fecha: Jueves, Noviembre 27, 2008 5:45 pm
Asunto: Re: [Wien] lapw1para error while running k-point parallel calculation
> > It seems that you do not have a proper environment when doing the
>
> > ssh hostname ....
> >
> > a) Are you sure the names "localhost" work properly ? Usually you
> > should put there the correct hostname so that you can do
> > ssh hostname echo $WIENROOT
>
> > b) do you get the proper directory from the above command ? Your
> > basic error seems to be:
> > 12778bash: lapw1c: command not found
>
> Both
> ssh localhost echo $WIENROOT
> ssh titin-desktop is the hostname) echo $WIENROOT
>
> issue:
> /home/titin/Programas/WIEN2k
>
> which is the proper WIENROOT.
>
>
> > c) can you do:
> > ssh hostname
> > cd case_dir (where your files are)
> > x lapw1
>
> In both cases YES
>
> titin at titin-desktop:~/Programas/WIEN2k/titin/benchmark/test_case$ x
> lapw1 LAPW1 END
> 127.187u 0.584s 2:08.12 99.7% 0+0k 10912+33256io 55pf+0w
>
> >
> > d) the parallel lapw1 works like the above, but does it at once:
> > ssh hostname;cd $PWD;lapw1 lapw1.def
>
> Do you mean just issuing all those commands exactly as you wrote? I
> log in:
>
> titin at titin-desktop:~$ ssh titin-desktop;cd
> ~/Programas/WIEN2k/titin/benchmark/test_case;lapw1 lapw1.def
> Linux titin-desktop 2.6.27-10-generic #1 SMP Fri Nov 21 19:19:18
> UTC 2008 x86_64
>
> The programs included with the Ubuntu system are free software;
> the exact distribution terms for each program are described in the
> individual files in /usr/share/doc/*/copyright.
>
> Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
> applicable law.
>
> To access official Ubuntu documentation, please visit:
> http://help.ubuntu.com/
> Last login: Thu Nov 27 14:28:16 2008 from titin-desktop
>
> I left it there for about 3 hours and then logout, but ssh hangs.
> After Control+C I get:
>
> forrtl: error (69): process interrupted (SIGINT)
> Image PC Routine Line
> Source
> lapw1 00000000004A5EB2 Unknown Unknown
> Unknown
> Stack trace terminated abnormally.
>
> Thus it seems something was running, after all, but was stuck
> somehow, since it was taking forever to run this simple lapw1
> process for the test_case.
>
> Thanks a lot for your input!
>
> Roberto
>
> >
> >
> >
> > ROBERTO LUIS IGLESIAS PASTRANA schrieb:
> > > Hello all!
> > > Iḿ trying to set k-point parallelism up and running in my
> > computer, which has an Intel (R) Core(TM)2 Quad Q9300 @2.50GHz
> CPU,
> > runs Ubuntu 8.10, using ifort 11.0.069 and mkl libraries
> > 10.1.0.015, and Wien2k_08.3 version. I tried it first with test-
> > case from the benchmarking Wien2k web page. I wanted to do a
> > benchmarking such as the one in the thread starting from:
> > > http://zeus.theochem.tuwien.ac.at/pipermail/wien/2008-
> > August/011238.html> I wrote the following .machines file for my 4
> > processors:>
> >
> granularity:11:localhost1:localhost1:localhost1:localhostextrafine:1> When running x lapw1 -p I get the following error:
> > > titin at titin-
> desktop:~/Programas/WIEN2k/titin/benchmark/test_case$
> > x lapw1 -pstarting parallel lapw1 at jue nov 27 13:33:33 CET 2008-
> >
> > starting parallel LAPW1 jobs at jue nov 27 13:33:33 CET
> > 2008running LAPW1 in parallel mode (using .machines)4
> > number_of_parallel_jobs[1] 12778bash: lapw1c: command not
> > foundbash: fixerror_lapw: command not found[1] Done
>
> > ( ( $remote $machine[$p] ... localhost(1) 0.000u
> > 0.000s 0.00 0.00% 0+0k 0+0io 0pf+0w** LAPW1 crashed!cat: No
> > match.0.100u 0.160s 0:02.97 8.7% 0+0k 0+248io 0pf+0werror:
> command
> > /home/titin/Programas/WIEN2k/lapw1cpara -c lapw1.def failed
> > > Digging in Wien2k ML files, I did not find any problem exactly
> as
> > mine. There were some posts regarding the correct linking in
> WIEN2k
> > ROOT directory, therefore I checked:
> > > titin at titin-desktop:~/Programas/WIEN2k$ ls -alsp lapw1*11596 -
> > rwxr-xr-x 1 titin titin 11857076 2008-11-20 19:18 lapw111492 -
> rwxr-
> > xr-x 1 titin titin 11747349 2008-11-20 19:18 lapw1c 0
> lrwxrwxrwx
> > 1 titin titin 9 2008-11-18 19:24 lapw1cpara -> lapw1para
>
> > 0 lrwxrwxrwx 1 titin titin 14 2008-11-18 19:24 lapw1para ->
> > lapw1para_lapw 20 -rwxr-xr-x 1 titin titin 16661 2008-11-18
> > 19:24 lapw1para_lapw
> > > I think this means the links to the parallel versions are OK,
> > doesn't it?
> > > I also thought the problem may be due to the fact that
> test_case
> > had only one k-point in its *.klist file, as suggested by Peter
> in
> > the above mentioned thread
> > > http://zeus.theochem.tuwien.ac.at/pipermail/wien/2008-
> > August/011266.html> Then I decided to try for a bccFe unit cell.
> > The error was multiplied by 4 in this case:
> > > titin at titin-desktop:~/Programas/WIEN2k/titin/benchmark/bccFe$ x
> > lapw0 -pstarting parallel lapw0 at jue nov 27 13:11:34 CET 2008---
> --
> > --- .machine0 : processors
> > > running lapw0 in single mode LAPW0 END1.448u 0.108s 0:01.55
> > 99.3% 0+0k 16+448io 0pf+0wtitin at titin-
> > desktop:~/Programas/WIEN2k/titin/benchmark/bccFe$ x lapw1 -
> > pstarting parallel lapw1 at jue nov 27 13:11:52 CET 2008->
> > starting parallel LAPW1 jobs at jue nov 27 13:11:52 CET
> 2008running
> > LAPW1 in parallel mode (using .machines)4
> > number_of_parallel_jobs[1] 12297[2] 12317[3] 12337bash: lapw1:
> > command not foundbash: fixerror_lapw: command not foundbash:
> > lapw1:command not foundbash: fixerror_lapw: command not found[2]
> -
> > Done ( ( $remote $machine[$p] ...[1] -
> > Done ( ( $remote $machine[$p] ...[4]
> > 12401bash: lapw1: command not foundbash: fixerror_lapw: command
> not
> > foundbash: lapw1: command not foundbash: fixerror_lapw:command
> not
> > found[4] - Done ( ( $remote
> $machine[$p]
> > ...[3] + Done ( ( $remote $machine[$p]
> > ...[1] 12466[2] 12486bash: lapw1: command not foundbash:
> > fixerror_lapw: command not found[1] - Done
>
> > ( ( $remote $machine[$p] ...bash: lapw1: command not foundbash:
> > fixerror_lapw: command not found[2] Done
>
> > ( ( $remote $machine[$p] ... localhost(62) 0.000u 0.000s
> 0.00
> > 0.00% 0+0k 0+0io 0pf+0w localhost(62) 0.000u 0.000s 0.00
> > 0.00% 0+0k 0+0io 0pf+0w localhost(62) 0.000u 0.000s 0.00
> > 0.00% 0+0k 0+0io 0pf+0w localhost(62) 0.000u 0.000s 0.00
> > 0.00% 0+0k 0+0io 0pf+0w localhost(1) 0.000u 0.000s 0.00
> > 0.00% 0+0k 0+0io 0pf+0w localhost(1) 0.004u 0.000s 0.00
> > 400.00% 0+0k 0+0io 0pf+0w** LAPW1 crashed!cat: No
> > match.0.276u 0.228s 0:10.02 4.8% 0+0k 128+992io 1pf+0werror:
> > command /home/titin/Programas/WIEN2k/lapw1para lapw1.def failed
> > > Could this have something to do with communication between the
> > four CPUs? I first thought it could be due to passwordless ssh
> > login failure, but issuing:
> > > titin at titin-desktop:~$ ssh titin-desktopLinux titin-desktop
> > 2.6.27-10-generic #1 SMP Fri Nov 21 19:19:18 UTC 2008 x86_64
> > > The programs included with the Ubuntu system are free
> > software;the exact distribution terms for each program are
> > described in theindividual files in /usr/share/doc/*/copyright.
> > > Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent
> permitted
> > byapplicable law.
> > > To access official Ubuntu documentation, please
> > visit:http://help.ubuntu.com/Last login: Thu Nov 27 13:07:11 2008
> > from localhost
> > > seems to get through correctly.
> > > Maybe I'm asking something rather trivial, but I can't find a
> > solution. Does somebody have any idea? I would be very glad to
> > welcome suggestions. Please don't hesitate to let me know if you
> > need some other infos.
> > > Have a nice day!
> > > Roberto_______________________________________________Wien
> > mailing
> listWien at zeus.theochem.tuwien.ac.athttp://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien>
> > --
> >
> > P.Blaha
> > ------------------------------------------------------------------
> --
> > ------
> > Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> > Phone: +43-1-58801-15671 FAX: +43-1-58801-15698
> > Email: blaha at theochem.tuwien.ac.at WWW:
> > http://info.tuwien.ac.at/theochem/--------------------------------
> --
> > ----------------------------------------
> >
> > _______________________________________________
> > Wien mailing list
> > Wien at zeus.theochem.tuwien.ac.at
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >
More information about the Wien
mailing list