[Wien] lapw2 parallel crashed

Yongsheng Zhang zhang at fhi-berlin.mpg.de
Fri Jan 11 10:03:22 CET 2008


Additional information:

lapw2 parallel is only crashed when I am concerning the data transition
between nodes or connection between nodes, i.e.
It is no problem when I run the job in the local node in 2 CPU parallel.
But if I use the local node as the main node, which used to run lapw0,
and use one of other nodes to run parallel, the "lapw2 parallel crashed"
occurs. I don't know if I sell my problem correctly. So here is an
example: There are two nodes, called n01 and n02, (each node has 2
CPUs). When I login the n01 node and  directly use the 2 CPU to run
parallel, and .machines file is:
granularity:1
1: n01
1: n01
everything is fine.

But if I still stay on the n01 node, but use n02 to run parallel,
correspondingly my .machines file changes to ,

granularity:1
1: n02
1: n02

Then, my lapw2 parallel is crashed after lapw1.

Thank you very much
Zhang

zhang at fhi-berlin.mpg.de wrote:
> Dear all,
>
> The latest wien2k version (8.1) is compiled successfully in the IBM linux
> cluster, which use intel f95i version 9.0 as fortran compiler and cc as C
> compiler, and MKL 9.0 libraries.
>
> For small jobs such as bulk systems, it is no problem to use on single
> CPUs or  k-point parallel on several nodes (2 CPUs on each node). And for
> large system, it is only no problem if I run it on a single CPU or on the
> 2 CPUs parallel in one node. But if the k-point parallel includes more
> than 1 nodes, after lapw1 parallel is successfully done, the lapw2 is
> crashed with the following information: (example of the k-parallel on 2
> nodes, 4 CPUs)
>
>  LAPW0 END
>  LAPW1 END
>  LAPW1 END
>  LAPW1 END
>  LAPW1 END
> LAPW2 - FERMI; weighs written
> Segmentation fault
> Segmentation fault
>  LAPW2 END
>  LAPW2 END
> cp: cannot stat `.in.tmp': No such file or directory
> rm: cannot remove `.in.tmp': No such file or directory
> rm: cannot remove `.in.tmp1': No such file or directory
>
>   
>>   stop error
>>     
> For the lapw2 output files, case.scf2_1(2) contains all finished
> information, but case.scf2_3(4) only has one line.  In the lapw2.error
> file, it says,
> **  testerror: Error in Parallel LAPW2
>
> And in the dayfile, it says,
> **  LAPW2 crashed!
> 0.473u 0.412s 0:15.32 5.7%      0+0k 0+0io 0pf+0w
> error: command   /batch/mfh/yzhang/wien-08-t/lapw2para lapw2.def   failed
>
> In the same machine, my old Wien2k version run without any problem. So I
> am wondering if there is something wrong in the new version's
> lapw2para_lapw?
>
> BTW: I am sure my machines file is correct and use "real" machines' names.
>
> Thanks,
> Zhang
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>   


-- 
---------------------------------------------------------------------
Address:  Fritz-Haber-Institut, Abt. Theorie 
          Faradayweg 4-6 D-14195 Berlin (Germany)           
Phone:    +49 30 8413 4818
Fax:      +49 30 8413 4701
Email:    zhang at fhi-berlin.mpg.de 
---------------------------------------------------------------------
1-0.0735-11600-23.05




More information about the Wien mailing list