[Wien] lapw1para error while running k-point parallel calculation
ROBERTO LUIS IGLESIAS PASTRANA
roberto at uniovi.es
Thu Nov 27 17:45:58 CET 2008
>
> It seems that you do not have a proper environment when doing the
> ssh hostname ....
>
> a) Are you sure the names "localhost" work properly ? Usually you
> should put there the correct hostname so that you can do
> ssh hostname echo $WIENROOT
> b) do you get the proper directory from the above command ? Your
> basic error seems to be:
> 12778bash: lapw1c: command not found
Both
ssh localhost echo $WIENROOT
ssh titin-desktop is the hostname) echo $WIENROOT
issue:
/home/titin/Programas/WIEN2k
which is the proper WIENROOT.
> c) can you do:
> ssh hostname
> cd case_dir (where your files are)
> x lapw1
In both cases YES
titin at titin-desktop:~/Programas/WIEN2k/titin/benchmark/test_case$ x lapw1
LAPW1 END
127.187u 0.584s 2:08.12 99.7% 0+0k 10912+33256io 55pf+0w
>
> d) the parallel lapw1 works like the above, but does it at once:
> ssh hostname;cd $PWD;lapw1 lapw1.def
Do you mean just issuing all those commands exactly as you wrote? I log in:
titin at titin-desktop:~$ ssh titin-desktop;cd ~/Programas/WIEN2k/titin/benchmark/test_case;lapw1 lapw1.def
Linux titin-desktop 2.6.27-10-generic #1 SMP Fri Nov 21 19:19:18 UTC 2008 x86_64
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
To access official Ubuntu documentation, please visit:
http://help.ubuntu.com/
Last login: Thu Nov 27 14:28:16 2008 from titin-desktop
I left it there for about 3 hours and then logout, but ssh hangs. After Control+C I get:
forrtl: error (69): process interrupted (SIGINT)
Image PC Routine Line Source
lapw1 00000000004A5EB2 Unknown Unknown Unknown
Stack trace terminated abnormally.
Thus it seems something was running, after all, but was stuck somehow, since it was taking forever to run this simple lapw1 process for the test_case.
Thanks a lot for your input!
Roberto
>
>
>
> ROBERTO LUIS IGLESIAS PASTRANA schrieb:
> > Hello all!
> > Iḿ trying to set k-point parallelism up and running in my
> computer, which has an Intel (R) Core(TM)2 Quad Q9300 @2.50GHz CPU,
> runs Ubuntu 8.10, using ifort 11.0.069 and mkl libraries
> 10.1.0.015, and Wien2k_08.3 version. I tried it first with test-
> case from the benchmarking Wien2k web page. I wanted to do a
> benchmarking such as the one in the thread starting from:
> > http://zeus.theochem.tuwien.ac.at/pipermail/wien/2008-
> August/011238.html> I wrote the following .machines file for my 4
> processors:>
> granularity:11:localhost1:localhost1:localhost1:localhostextrafine:1> When running x lapw1 -p I get the following error:
> > titin at titin-desktop:~/Programas/WIEN2k/titin/benchmark/test_case$
> x lapw1 -pstarting parallel lapw1 at jue nov 27 13:33:33 CET 2008->
> starting parallel LAPW1 jobs at jue nov 27 13:33:33 CET
> 2008running LAPW1 in parallel mode (using .machines)4
> number_of_parallel_jobs[1] 12778bash: lapw1c: command not
> foundbash: fixerror_lapw: command not found[1] Done
> ( ( $remote $machine[$p] ... localhost(1) 0.000u
> 0.000s 0.00 0.00% 0+0k 0+0io 0pf+0w** LAPW1 crashed!cat: No
> match.0.100u 0.160s 0:02.97 8.7% 0+0k 0+248io 0pf+0werror: command
> /home/titin/Programas/WIEN2k/lapw1cpara -c lapw1.def failed
> > Digging in Wien2k ML files, I did not find any problem exactly as
> mine. There were some posts regarding the correct linking in WIEN2k
> ROOT directory, therefore I checked:
> > titin at titin-desktop:~/Programas/WIEN2k$ ls -alsp lapw1*11596 -
> rwxr-xr-x 1 titin titin 11857076 2008-11-20 19:18 lapw111492 -rwxr-
> xr-x 1 titin titin 11747349 2008-11-20 19:18 lapw1c 0 lrwxrwxrwx
> 1 titin titin 9 2008-11-18 19:24 lapw1cpara -> lapw1para
> 0 lrwxrwxrwx 1 titin titin 14 2008-11-18 19:24 lapw1para ->
> lapw1para_lapw 20 -rwxr-xr-x 1 titin titin 16661 2008-11-18
> 19:24 lapw1para_lapw
> > I think this means the links to the parallel versions are OK,
> doesn't it?
> > I also thought the problem may be due to the fact that test_case
> had only one k-point in its *.klist file, as suggested by Peter in
> the above mentioned thread
> > http://zeus.theochem.tuwien.ac.at/pipermail/wien/2008-
> August/011266.html> Then I decided to try for a bccFe unit cell.
> The error was multiplied by 4 in this case:
> > titin at titin-desktop:~/Programas/WIEN2k/titin/benchmark/bccFe$ x
> lapw0 -pstarting parallel lapw0 at jue nov 27 13:11:34 CET 2008-----
> --- .machine0 : processors
> > running lapw0 in single mode LAPW0 END1.448u 0.108s 0:01.55
> 99.3% 0+0k 16+448io 0pf+0wtitin at titin-
> desktop:~/Programas/WIEN2k/titin/benchmark/bccFe$ x lapw1 -
> pstarting parallel lapw1 at jue nov 27 13:11:52 CET 2008->
> starting parallel LAPW1 jobs at jue nov 27 13:11:52 CET 2008running
> LAPW1 in parallel mode (using .machines)4
> number_of_parallel_jobs[1] 12297[2] 12317[3] 12337bash: lapw1:
> command not foundbash: fixerror_lapw: command not foundbash:
> lapw1:command not foundbash: fixerror_lapw: command not found[2] -
> Done ( ( $remote $machine[$p] ...[1] -
> Done ( ( $remote $machine[$p] ...[4]
> 12401bash: lapw1: command not foundbash: fixerror_lapw: command not
> foundbash: lapw1: command not foundbash: fixerror_lapw:command not
> found[4] - Done ( ( $remote $machine[$p]
> ...[3] + Done ( ( $remote $machine[$p]
> ...[1] 12466[2] 12486bash: lapw1: command not foundbash:
> fixerror_lapw: command not found[1] - Done
> ( ( $remote $machine[$p] ...bash: lapw1: command not foundbash:
> fixerror_lapw: command not found[2] Done
> ( ( $remote $machine[$p] ... localhost(62) 0.000u 0.000s 0.00
> 0.00% 0+0k 0+0io 0pf+0w localhost(62) 0.000u 0.000s 0.00
> 0.00% 0+0k 0+0io 0pf+0w localhost(62) 0.000u 0.000s 0.00
> 0.00% 0+0k 0+0io 0pf+0w localhost(62) 0.000u 0.000s 0.00
> 0.00% 0+0k 0+0io 0pf+0w localhost(1) 0.000u 0.000s 0.00
> 0.00% 0+0k 0+0io 0pf+0w localhost(1) 0.004u 0.000s 0.00
> 400.00% 0+0k 0+0io 0pf+0w** LAPW1 crashed!cat: No
> match.0.276u 0.228s 0:10.02 4.8% 0+0k 128+992io 1pf+0werror:
> command /home/titin/Programas/WIEN2k/lapw1para lapw1.def failed
> > Could this have something to do with communication between the
> four CPUs? I first thought it could be due to passwordless ssh
> login failure, but issuing:
> > titin at titin-desktop:~$ ssh titin-desktopLinux titin-desktop
> 2.6.27-10-generic #1 SMP Fri Nov 21 19:19:18 UTC 2008 x86_64
> > The programs included with the Ubuntu system are free
> software;the exact distribution terms for each program are
> described in theindividual files in /usr/share/doc/*/copyright.
> > Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted
> byapplicable law.
> > To access official Ubuntu documentation, please
> visit:http://help.ubuntu.com/Last login: Thu Nov 27 13:07:11 2008
> from localhost
> > seems to get through correctly.
> > Maybe I'm asking something rather trivial, but I can't find a
> solution. Does somebody have any idea? I would be very glad to
> welcome suggestions. Please don't hesitate to let me know if you
> need some other infos.
> > Have a nice day!
> > Roberto_______________________________________________Wien
> mailing listWien at zeus.theochem.tuwien.ac.athttp://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
> --
>
> P.Blaha
> --------------------------------------------------------------------
> ------
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-15671 FAX: +43-1-58801-15698
> Email: blaha at theochem.tuwien.ac.at WWW:
> http://info.tuwien.ac.at/theochem/----------------------------------
> ----------------------------------------
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
More information about the Wien
mailing list