[Wien] commlib error

Peter Blaha pblaha at theochem.tuwien.ac.at
Thu Jul 9 09:42:48 CEST 2015


The "comlib" error is certainly a system error, where the communication 
between the nodes is broken somehow.

 From wien2k you got the error that in the sumpara step (after lapw2) it 
could not find the file    Pr-Af.scf2up_31

So the first question you have to pose yourself is: do I have this file 
and is it ok ?

ls -alsrp *scf2up_*

You should find many of these files (as many as k-parallel jobs are 
submitted) and ALL of them should have a reasonable length (at least 
non-zero).

My suspicion is, that the network filesystem on your system is a bit 
slow in updating the files on different nodes and therefore the errors 
occur randomly after a few iterations.

You did not say how you parallelize nor what the cputime is, but a few tips:

- reduce the number of k-point parallel jobs (I hope you did NOT 
distribute the 200 k-points onto 200 cores !). Depending on the matrix 
size, you may try some (higher) mpi-parallelism.

- make sure you are using a local "SCRATCH" directory to reduce network 
load (AND a compatible k-parallelism, i.e. (num-kpt / n-core) must be an 
integer)

- increase the "sleep" times in $WIENROOT/lapw2para (and maybe 
lapw1para) from the defaults to larger values like
setenv DELAY   0.5              # delay launching of processes by n seconds
setenv SLEEPY  4                # additional sleep before checking



On 07/09/2015 07:51 AM, Imran Khan wrote:
> Dear wien2k experts and users,
> I am using wien2k version 14.2 on a queuing system (SGE), with intel
> compiler 11.1, MPI libraries mpi/openmpi-1.6.3 and math libraries
> fftw-3.3.4. With these options I install Wien2K without any compile time
> error.
> The purpose of my calculation is to find the stable site for different
> substituents in NdFeB intermetallics.
> I am running the case.struct given in the attachment, using 200 (6 6 4)
> k-points. My RKmax value is 7 and Gmax is 12, and I am using LDA+U method.
> I am using the following command  runsp_lapw -p -orb -i 80 -ec 0.0001
> -cc 0.001
> Every time I submit my job after few scf cycles the job is terminated
> with the following error in the error tag file.
>
> error: commlib error: got select error (Connection reset by peer)
> error: executing task of job 2424636 failed: failed sending task to
> execd at tachyon1478: can't find connection
>      .
>      .
>      .
>   LAPW2 END
>   LAPW2 END
>   LAPW2 END
>   LAPW2 END
> real    0m53.638s
> forrtl: No such file or directory
> forrtl: severe (29): file not found, unit 21, file
> /home01/x1030imr/khan/Wien2K/Neomagnet/Pr-doped/f-site/AFM/Pr-Af/Pr-Af.scf2up_31
> Image              PC                Routine            Line        Source
> sumpara            00000000004A671D  Unknown               Unknown  Unknown
> sumpara            00000000004A5225  Unknown               Unknown  Unknown
> sumpara            0000000000456259  Unknown               Unknown  Unknown
> sumpara            0000000000416A5A  Unknown               Unknown  Unknown
> sumpara            0000000000416250  Unknown               Unknown  Unknown
> sumpara            0000000000421E3D  Unknown               Unknown  Unknown
> sumpara            0000000000410771  scfsum_                   126  scfsum.f
> sumpara            000000000040EE82  MAIN__                    219
>   sumpara.f
> sumpara            00000000004033DC  Unknown               Unknown  Unknown
> libc.so.6          00000035AA81D974  Unknown               Unknown  Unknown
> sumpara            00000000004032E9  Unknown               Unknown  Unknown
> cp: cannot stat `.in.tmp': No such file or directory
>
> I have discussed this error with the engineers of that queuing system
> (tachyon), and I have searched the mailing list as well but could not
> find any solutions.
> your guidance to solve this issue will be greatly appreciated.
> Imran
>
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:  http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>

-- 

                                       P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: blaha at theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/staff/tc_group_e.php
--------------------------------------------------------------------------


More information about the Wien mailing list