[Wien] Network problem caused by lapw1?

Oleg Rubel rubelo at tbh.net
Thu Dec 24 00:43:06 CET 2009


Thank you very much for the hint.

I found in my $SCRATCH directory 256 *storeHinv* files 221MB each. My mistake was to use a nfs mounted directory on a head node as $SCRATCH. I changed it now to a local directory.

Thank you once again,

Oleg

>>> Peter Blaha <pblaha at theochem.tuwien.ac.at> 12/23/09 2:17 AM >>>
The new iterative diagonalization creates files called case.storeHinv.., where the inverse of H is
stored (one triangle of the matrix in single precision).

These files can be quite large (eg. for matrix size 30000 the size of all Hinv-files (# of processors)
is 3600MB or 7200MB (real/complex), but on a balanced cluster they should be written/read in
100-200 seconds. It is created only once (in the second scf cycle), but read in all subsequent
iterative scf cycles.

Please note, that the method is usually so efficient, that one can run even a minimization with -it0:
min -j "run_lapw -it0"; i.e. one does not need to create it again!

Similar as with the vector files, you can use the SCRATCH variable, to direct these files to a local
scratch directory (eg. with 100 processors, each processor reads/writes only 36MB !)


> I observe the cluster network dying for about 10 minutes when performing calculation for a relatively large case that involves 256 cores and InfiniBand. I use WIEN2k_09.2 (Release 29/9/2009) + ifort 11.0.074 + Intel MKL 10.1.0.015 + MVAPICH2 and iterative diagonalization. The network dyes always at the end of the second scf iteration iteration (most likely at the end of lapw1). This did not occur in WIEN2k_08.3 (Release 18/9/2008) for the same case and compiler settings. I know that the iterative diagonalization has undergone some major changes between these two versions.
> 
> This actually does not interrupt the calculations and there is no sign of any error, but it causes SGE demon to die on compute nodes with all consequences.
> 
> Did anyone experience a similar problem? What is differently in the behaviour of lapw1 for the 2nd iteration that may cause the problem?
> 
> Thank you in advance and Happy Holidays.
> 
> Oleg Rubel
> 
> --
> Thunder Bay Regional Research Institute
> 290 Munro St, Thunder Bay, ON, P7A 7T1, Canada
> Homepage: http://www.tbrri.com/~orubel/
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien

-- 
-----------------------------------------
Peter Blaha
Inst. Materials Chemistry, TU Vienna
Getreidemarkt 9, A-1060 Vienna, Austria
Tel: +43-1-5880115671
Fax: +43-1-5880115698
email: pblaha at theochem.tuwien.ac.at
-----------------------------------------
_______________________________________________
Wien mailing list
Wien at zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien



More information about the Wien mailing list