[Wien] lapw2 parallel crashed
Yongsheng Zhang
zhang at fhi-berlin.mpg.de
Fri Jan 11 10:03:22 CET 2008
Additional information:
lapw2 parallel is only crashed when I am concerning the data transition
between nodes or connection between nodes, i.e.
It is no problem when I run the job in the local node in 2 CPU parallel.
But if I use the local node as the main node, which used to run lapw0,
and use one of other nodes to run parallel, the "lapw2 parallel crashed"
occurs. I don't know if I sell my problem correctly. So here is an
example: There are two nodes, called n01 and n02, (each node has 2
CPUs). When I login the n01 node and directly use the 2 CPU to run
parallel, and .machines file is:
granularity:1
1: n01
1: n01
everything is fine.
But if I still stay on the n01 node, but use n02 to run parallel,
correspondingly my .machines file changes to ,
granularity:1
1: n02
1: n02
Then, my lapw2 parallel is crashed after lapw1.
Thank you very much
Zhang
zhang at fhi-berlin.mpg.de wrote:
> Dear all,
>
> The latest wien2k version (8.1) is compiled successfully in the IBM linux
> cluster, which use intel f95i version 9.0 as fortran compiler and cc as C
> compiler, and MKL 9.0 libraries.
>
> For small jobs such as bulk systems, it is no problem to use on single
> CPUs or k-point parallel on several nodes (2 CPUs on each node). And for
> large system, it is only no problem if I run it on a single CPU or on the
> 2 CPUs parallel in one node. But if the k-point parallel includes more
> than 1 nodes, after lapw1 parallel is successfully done, the lapw2 is
> crashed with the following information: (example of the k-parallel on 2
> nodes, 4 CPUs)
>
> LAPW0 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW1 END
> LAPW2 - FERMI; weighs written
> Segmentation fault
> Segmentation fault
> LAPW2 END
> LAPW2 END
> cp: cannot stat `.in.tmp': No such file or directory
> rm: cannot remove `.in.tmp': No such file or directory
> rm: cannot remove `.in.tmp1': No such file or directory
>
>
>> stop error
>>
> For the lapw2 output files, case.scf2_1(2) contains all finished
> information, but case.scf2_3(4) only has one line. In the lapw2.error
> file, it says,
> ** testerror: Error in Parallel LAPW2
>
> And in the dayfile, it says,
> ** LAPW2 crashed!
> 0.473u 0.412s 0:15.32 5.7% 0+0k 0+0io 0pf+0w
> error: command /batch/mfh/yzhang/wien-08-t/lapw2para lapw2.def failed
>
> In the same machine, my old Wien2k version run without any problem. So I
> am wondering if there is something wrong in the new version's
> lapw2para_lapw?
>
> BTW: I am sure my machines file is correct and use "real" machines' names.
>
> Thanks,
> Zhang
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
--
---------------------------------------------------------------------
Address: Fritz-Haber-Institut, Abt. Theorie
Faradayweg 4-6 D-14195 Berlin (Germany)
Phone: +49 30 8413 4818
Fax: +49 30 8413 4701
Email: zhang at fhi-berlin.mpg.de
---------------------------------------------------------------------
1-0.0735-11600-23.05
More information about the Wien
mailing list