[Wien] lapw1para error while running k-point parallel calculation

ROBERTO LUIS IGLESIAS PASTRANA roberto at uniovi.es
Fri Nov 28 10:19:53 CET 2008


Hello again!

I made some more attempts on my own. Again
 titin at titin-desktop:~$ ssh titin-desktop;cd  ~/Programas/WIEN2k/titin/benchmark/test_case;lapw1 lapw1.def

logs in. After a while I type
$exit
logout
Connection to titin-desktop closed
..........I leave it there, where before I thought it was just hanging
LAPW1 END!!!!!!!!!!!!!!!!!!!!!!

Inside of test_case the same files appear as in the serial case test, with the exception of :log, which is missing, and the presence of test_case.vector, which was not there before, my SCRATCH is supposed to be /tmp, but here it did not use it, I don't know why, and kept it within the working dir. I don't even know if this is again a serial calculation.
Then I typed:

$grep HORB *output1*
test_case.output1:       TIME HAMILT (CPU)  =    13.3, HNS =     7.6, HORB =     0.0, DIAG =    28.2

In thenormal  serial case this was:
test_case.output1:       TIME HAMILT (CPU)  =    17.6, HNS =    12.7, HORB =     0.0, DIAG =    95.8

I compared the results with the files *output1_mkl8_* present in the test_case folder as well:

$tail -5 test_case.output1_mkl8_1proc
       NUMBER OF K-POINTS:           1
   ===> TOTAL CPU       TIME:    153.6 (INIT =      1.5 + K-POINTS =    152.1)
   > SUM OF WALL CLOCK TIMES:    161.3 (INIT =      1.9 + K-POINTS =    159.4)
      Maximum WALL clock time:    162.053758144379     
      Maximum CPU time:           153.720000000000 

$ tail -5 test_case.output1_mkl8_2proc 
       NUMBER OF K-POINTS:           1
   ===> TOTAL CPU       TIME:    118.5 (INIT =      1.5 + K-POINTS =    116.9)
   > SUM OF WALL CLOCK TIMES:    124.3 (INIT =      1.8 + K-POINTS =    122.5)
      Maximum WALL clock time:    124.882931947708     
      Maximum CPU time:           118.560000000000   

$ tail -5 test_case.output1
       NUMBER OF K-POINTS:           1
   ===> TOTAL CPU       TIME:     50.0 (INIT =      1.0 + K-POINTS =     49.0)
   > SUM OF WALL CLOCK TIMES:     50.4 (INIT =      1.1 + K-POINTS =     49.3)
      Maximum WALL clock time:    50.5337390899658     
      Maximum CPU time:           50.0800000000000 

Now it takes less than 1 minute and less than half the CPU time. 

Therefore I decided to go back again to bccFe in order to check the times when more than 1 k-point is used.

Therefore I issued

$ssh titin-desktop

Normal log in

$cd ~/Programas/WIEN2k/titin/benchmark/bccFe
$runsp_lapw -i 200 -ec 0.00001 -cc 0.0001 -p

 LAPW0 END
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
cat: No match.

>   stop error

where "orden no encontrada" means "command not found" in Spanish.
Therefore I typed:
$~$ ssh titin-desktop;cd ~/Programas/WIEN2k/titin/benchmark/bccFe_parallel/1_Thread/bccFe;runsp_lapw -i 200 -ec 0.00001 -cc 0.0001 -p
It logged in, left it for about 5 minutes (the serial calculation took a little bit more than 3) and
$exit
logout
Connection to titin-desktop closed.
 LAPW0 END
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
bash: lapw1: orden no encontrada
bash: fixerror_lapw: orden no encontrada
cat: No match.

>   stop error

AGAIN!

This seems to imply  that the actual calculation takes place (wrongly) only after I log out. 
Sorry if I am asking stupid questions, but I never had access to a multiple CPU machine up to now.

Furthermore, for the test_case again:

~$ ssh titin-desktop;cd ~/Programas/WIEN2k/titin/benchmark/test_case;x lapw1 -p
logs in
$ exit
logout
Connection to titin-desktop closed.
starting parallel lapw1 at vie nov 28 10:15:57 CET 2008
->  starting parallel LAPW1 jobs at vie nov 28 10:15:57 CET 2008
running LAPW1 in parallel mode (using .machines)
4 number_of_parallel_jobs
[1] 26958
bash: lapw1c: orden no encontrada
bash: fixerror_lapw: orden no encontrada
[1]    Done                          ( ( $remote $machine[$p]  ...
     localhost(1) 0.000u 0.000s 0.00 0.00%      0+0k 0+0io 0pf+0w
**  LAPW1 crashed!
cat: No match.
0.116u 0.128s 0:03.21 7.1%	0+0k 0+248io 0pf+0w
error: command   /home/titin/Programas/WIEN2k/lapw1cpara -c lapw1.def   failed

The same error again!! It seems something is wrong with parallelization or the -p switch.

Thanks for your patience!!

Kind regards

Roberto


----- Mensaje original -----
De: ROBERTO LUIS IGLESIAS PASTRANA <roberto at uniovi.es>
Fecha: Jueves, Noviembre 27, 2008 5:45 pm
Asunto: Re: [Wien] lapw1para error while running k-point	parallel	calculation


> > It seems that you do not have a proper environment when doing the 
>  
> > ssh hostname ....
> > 
> > a) Are you sure the names "localhost" work properly ? Usually you 
> > should put there the correct hostname so that you can do
> >    ssh hostname echo $WIENROOT
> 
> > b) do you get the proper directory from the above command ? Your 
> > basic error seems to be:
> >    12778bash: lapw1c: command not found
> 
> Both  
> ssh localhost echo $WIENROOT
> ssh titin-desktop is the hostname) echo $WIENROOT
> 
> issue:
> /home/titin/Programas/WIEN2k
> 
> which is the proper WIENROOT.
> 
> 
> > c) can you do:
> >    ssh hostname
> >    cd case_dir     (where your files are)
> >    x lapw1
> 
> In both cases YES
> 
> titin at titin-desktop:~/Programas/WIEN2k/titin/benchmark/test_case$ x 
> lapw1 LAPW1 END
> 127.187u 0.584s 2:08.12 99.7%	0+0k 10912+33256io 55pf+0w
> 
> > 
> > d) the parallel lapw1 works like the above, but does it at once:
> >    ssh hostname;cd $PWD;lapw1 lapw1.def
> 
> Do you mean just issuing all those commands exactly as you wrote? I 
> log in:
> 
> titin at titin-desktop:~$ ssh titin-desktop;cd 
> ~/Programas/WIEN2k/titin/benchmark/test_case;lapw1 lapw1.def
> Linux titin-desktop 2.6.27-10-generic #1 SMP Fri Nov 21 19:19:18 
> UTC 2008 x86_64
> 
> The programs included with the Ubuntu system are free software;
> the exact distribution terms for each program are described in the
> individual files in /usr/share/doc/*/copyright.
> 
> Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
> applicable law.
> 
> To access official Ubuntu documentation, please visit:
> http://help.ubuntu.com/
> Last login: Thu Nov 27 14:28:16 2008 from titin-desktop
> 
> I left it there for about 3 hours and then logout, but ssh hangs. 
> After Control+C I get:
> 
> forrtl: error (69): process interrupted (SIGINT)
> Image              PC                Routine            Line        
> Source             
> lapw1              00000000004A5EB2  Unknown               Unknown  
> Unknown
> Stack trace terminated abnormally.
> 
> Thus it seems something was running, after all, but was stuck 
> somehow, since it was taking forever to run this simple lapw1 
> process for the test_case.
> 
> Thanks a lot for your input!
> 
> Roberto
> 
> > 
> > 
> > 
> > ROBERTO LUIS IGLESIAS PASTRANA schrieb:
> > > Hello all!
> > > Iḿ trying to set k-point parallelism up and running in my 
> > computer, which has an Intel (R) Core(TM)2 Quad Q9300 @2.50GHz 
> CPU, 
> > runs Ubuntu 8.10, using  ifort 11.0.069 and mkl libraries 
> > 10.1.0.015, and Wien2k_08.3 version. I tried it first with test-
> > case from the benchmarking Wien2k web page. I wanted to do a 
> > benchmarking such as the one in the thread starting from:
> > > http://zeus.theochem.tuwien.ac.at/pipermail/wien/2008-
> > August/011238.html> I wrote the following .machines file for my 4 
> > processors:> 
> > 
> granularity:11:localhost1:localhost1:localhost1:localhostextrafine:1> When  running x lapw1 -p I get the following error:
> > > titin at titin-
> desktop:~/Programas/WIEN2k/titin/benchmark/test_case$ 
> > x lapw1 -pstarting parallel lapw1 at jue nov 27 13:33:33 CET 2008-
> > 
> > starting parallel LAPW1 jobs at jue nov 27 13:33:33 CET 
> > 2008running LAPW1 in parallel mode (using .machines)4 
> > number_of_parallel_jobs[1] 12778bash: lapw1c: command not 
> > foundbash: fixerror_lapw: command not found[1]    Done            
>  
> >            ( ( $remote $machine[$p]  ...     localhost(1) 0.000u 
> > 0.000s 0.00 0.00%      0+0k 0+0io 0pf+0w**  LAPW1 crashed!cat: No 
> > match.0.100u 0.160s 0:02.97 8.7%	0+0k 0+248io 0pf+0werror: 
> command  
> > /home/titin/Programas/WIEN2k/lapw1cpara -c lapw1.def   failed
> > > Digging in Wien2k ML files, I did not find any problem exactly 
> as 
> > mine. There were some posts regarding the correct linking in 
> WIEN2k 
> > ROOT directory, therefore I checked:
> > > titin at titin-desktop:~/Programas/WIEN2k$ ls -alsp lapw1*11596 -
> > rwxr-xr-x 1 titin titin 11857076 2008-11-20 19:18 lapw111492 -
> rwxr-
> > xr-x 1 titin titin 11747349 2008-11-20 19:18 lapw1c    0 
> lrwxrwxrwx 
> > 1 titin titin        9 2008-11-18 19:24 lapw1cpara -> lapw1para   
> 
> > 0 lrwxrwxrwx 1 titin titin       14 2008-11-18 19:24 lapw1para -> 
> > lapw1para_lapw   20 -rwxr-xr-x 1 titin titin    16661 2008-11-18 
> > 19:24 lapw1para_lapw
> > > I think this means the links to the parallel versions are OK, 
> > doesn't it?
> > > I also thought the problem may be due to the fact that 
> test_case 
> > had only one k-point in its *.klist file, as suggested by Peter 
> in 
> > the above mentioned thread
> > > http://zeus.theochem.tuwien.ac.at/pipermail/wien/2008-
> > August/011266.html> Then I decided to try for a bccFe unit cell. 
> > The error was multiplied by 4 in this case:
> > > titin at titin-desktop:~/Programas/WIEN2k/titin/benchmark/bccFe$ x 
> > lapw0 -pstarting parallel lapw0 at jue nov 27 13:11:34 CET 2008---
> --
> > --- .machine0 : processors
> > > running lapw0 in single mode LAPW0 END1.448u 0.108s 0:01.55 
> > 99.3%	0+0k 16+448io 0pf+0wtitin at titin-
> > desktop:~/Programas/WIEN2k/titin/benchmark/bccFe$ x lapw1 -
> > pstarting parallel lapw1 at jue nov 27 13:11:52 CET 2008->  
> > starting parallel LAPW1 jobs at jue nov 27 13:11:52 CET 
> 2008running 
> > LAPW1 in parallel mode (using .machines)4 
> > number_of_parallel_jobs[1] 12297[2] 12317[3] 12337bash: lapw1: 
> > command not foundbash: fixerror_lapw: command not foundbash: 
> > lapw1:command not foundbash: fixerror_lapw: command not found[2]  
> - 
> > Done                          ( ( $remote $machine[$p]  ...[1]  - 
> > Done                          ( ( $remote $machine[$p]  ...[4] 
> > 12401bash: lapw1: command not foundbash: fixerror_lapw: command 
> not 
> > foundbash: lapw1: command not foundbash: fixerror_lapw:command 
> not 
> > found[4]  - Done                          ( ( $remote 
> $machine[$p]  
> > ...[3]  + Done                          ( ( $remote $machine[$p]  
> > ...[1] 12466[2] 12486bash: lapw1: command not foundbash: 
> > fixerror_lapw: command not found[1]  - Done                       
>  
> > ( ( $remote $machine[$p]  ...bash: lapw1: command not foundbash: 
> > fixerror_lapw: command not found[2]    Done                       
>  
> > ( ( $remote $machine[$p]  ...     localhost(62) 0.000u 0.000s 
> 0.00 
> > 0.00%      0+0k 0+0io 0pf+0w     localhost(62) 0.000u 0.000s 0.00 
> > 0.00%      0+0k 0+0io 0pf+0w     localhost(62) 0.000u 0.000s 0.00 
> > 0.00%      0+0k 0+0io 0pf+0w     localhost(62) 0.000u 0.000s 0.00 
> > 0.00%      0+0k 0+0io 0pf+0w     localhost(1) 0.000u 0.000s 0.00 
> > 0.00%      0+0k 0+0io 0pf+0w     localhost(1) 0.004u 0.000s 0.00 
> > 400.00%      0+0k 0+0io 0pf+0w**  LAPW1 crashed!cat: No 
> > match.0.276u 0.228s 0:10.02 4.8%	0+0k 128+992io 1pf+0werror: 
> > command   /home/titin/Programas/WIEN2k/lapw1para lapw1.def   failed
> > > Could this have something to do with communication between the 
> > four CPUs? I first thought it could be due to passwordless ssh 
> > login failure, but issuing:
> > > titin at titin-desktop:~$ ssh titin-desktopLinux titin-desktop 
> > 2.6.27-10-generic #1 SMP Fri Nov 21 19:19:18 UTC 2008 x86_64
> > > The programs included with the Ubuntu system are free 
> > software;the exact distribution terms for each program are 
> > described in theindividual files in /usr/share/doc/*/copyright.
> > > Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent 
> permitted 
> > byapplicable law.
> > > To access official Ubuntu documentation, please 
> > visit:http://help.ubuntu.com/Last login: Thu Nov 27 13:07:11 2008 
> > from localhost
> > > seems to get through correctly.
> > > Maybe I'm asking something rather trivial, but I can't find a 
> > solution. Does somebody have any idea? I would be very glad to 
> > welcome suggestions. Please don't hesitate to let me know if you 
> > need some other infos.
> > > Have a nice day!
> > > Roberto_______________________________________________Wien 
> > mailing 
> listWien at zeus.theochem.tuwien.ac.athttp://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien> 
> > -- 
> > 
> >                                       P.Blaha
> > ------------------------------------------------------------------
> --
> > ------
> > Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> > Phone: +43-1-58801-15671             FAX: +43-1-58801-15698
> > Email: blaha at theochem.tuwien.ac.at    WWW: 
> > http://info.tuwien.ac.at/theochem/--------------------------------
> --
> > ----------------------------------------
> > 
> > _______________________________________________
> > Wien mailing list
> > Wien at zeus.theochem.tuwien.ac.at
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> >



More information about the Wien mailing list