[Wien] lapw2 parallel crashed
Peter Blaha
pblaha at theochem.tuwien.ac.at
Fri Jan 11 10:52:02 CET 2008
You are on the right track to analyze your problem.
Did you define a $SCRATCH variable pointing to some local directory ?
Otherwise try to verify and isolate the problem by running the steps
"by hand".
x lapw2 -p on either n01 or n02
or even:
ssh n02 "cd $PWD;time lapw2 lapw2_1.def"
Yongsheng Zhang schrieb:
> Additional information:
>
> lapw2 parallel is only crashed when I am concerning the data transition
> between nodes or connection between nodes, i.e.
> It is no problem when I run the job in the local node in 2 CPU parallel.
> But if I use the local node as the main node, which used to run lapw0,
> and use one of other nodes to run parallel, the "lapw2 parallel crashed"
> occurs. I don't know if I sell my problem correctly. So here is an
> example: There are two nodes, called n01 and n02, (each node has 2
> CPUs). When I login the n01 node and directly use the 2 CPU to run
> parallel, and .machines file is:
> granularity:1
> 1: n01
> 1: n01
> everything is fine.
>
> But if I still stay on the n01 node, but use n02 to run parallel,
> correspondingly my .machines file changes to ,
>
> granularity:1
> 1: n02
> 1: n02
>
> Then, my lapw2 parallel is crashed after lapw1.
>
> Thank you very much
> Zhang
>
> zhang at fhi-berlin.mpg.de wrote:
>> Dear all,
>>
>> The latest wien2k version (8.1) is compiled successfully in the IBM linux
>> cluster, which use intel f95i version 9.0 as fortran compiler and cc as C
>> compiler, and MKL 9.0 libraries.
>>
>> For small jobs such as bulk systems, it is no problem to use on single
>> CPUs or k-point parallel on several nodes (2 CPUs on each node). And for
>> large system, it is only no problem if I run it on a single CPU or on the
>> 2 CPUs parallel in one node. But if the k-point parallel includes more
>> than 1 nodes, after lapw1 parallel is successfully done, the lapw2 is
>> crashed with the following information: (example of the k-parallel on 2
>> nodes, 4 CPUs)
>>
>> LAPW0 END
>> LAPW1 END
>> LAPW1 END
>> LAPW1 END
>> LAPW1 END
>> LAPW2 - FERMI; weighs written
>> Segmentation fault
>> Segmentation fault
>> LAPW2 END
>> LAPW2 END
>> cp: cannot stat `.in.tmp': No such file or directory
>> rm: cannot remove `.in.tmp': No such file or directory
>> rm: cannot remove `.in.tmp1': No such file or directory
>>
>>
>>> stop error
>>>
>> For the lapw2 output files, case.scf2_1(2) contains all finished
>> information, but case.scf2_3(4) only has one line. In the lapw2.error
>> file, it says,
>> ** testerror: Error in Parallel LAPW2
>>
>> And in the dayfile, it says,
>> ** LAPW2 crashed!
>> 0.473u 0.412s 0:15.32 5.7% 0+0k 0+0io 0pf+0w
>> error: command /batch/mfh/yzhang/wien-08-t/lapw2para lapw2.def failed
>>
>> In the same machine, my old Wien2k version run without any problem. So I
>> am wondering if there is something wrong in the new version's
>> lapw2para_lapw?
>>
>> BTW: I am sure my machines file is correct and use "real" machines' names.
>>
>> Thanks,
>> Zhang
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>
>
--
P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-15671 FAX: +43-1-58801-15698
Email: blaha at theochem.tuwien.ac.at WWW: http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------
More information about the Wien
mailing list