[Wien] problems in wien2k 10 run

Laurence Marks L-marks at northwestern.edu
Thu May 19 18:22:18 CEST 2011


For GPFS (and many NFS) '-assu buff' is needed and in fact I persuaded
the people here to make this default using the environmental parameter
which many other clusters do. Is this GPFS from IBM? If so, there is
in fact a "bug" in it which Wien2k triggers by writing to the same
file from different cores/CPUs. This is fixed with 11, but maybe not
in the version you have. Compare errflg.f in lapw[0-2], it should be
something like

      if(myid.eq.0) then
         OPEN (99,FILE=FNAME,ERR=900)
         WRITE (99,9000) MSG
!         CLOSE (99)
      endif

in the recent versions.

There is also a potential bug in terms of how you are running mpi jobs
as Wien2k uses ssh by default and torque does not know to kill this
and many versions of ssh do not propogate kill signals to children.
There is a discussion on this in this that I started at
http://www.open-mpi.org/community/lists/users/2011/04/16085.php . I
have some meetings, but if this is similar to what you have contact me
offline as there is a modification to parallel_options that seems to
work.

2011/5/19 saed alazar <q_saed74 at yahoo.com>:
> The code works just fine on the student cluster where we have compiled it
> using the extra '-assu buff' flag to alleviate NFS problems. I think we have
> seen that the optimisation jobs work well there.
>
> There are still problems with the optimisation jobs running on the planck
> cluster though.
>
> It seems that there are 2 main problems:
>
> 1   One node appears to be doing nearly all the work while the others do
> little... as seen in the dayfile. However if we login to the job nodes while
> the job is running and run the top command, all cores seem to be using ~100%
> cpu, load is normal. Also 'cat /proc/meminfo' shows there is plenty of free
> memory (there should be as these nodes each have 32GB RAM).
> 2   After some time it becomes impossible to login to your home directory
> and I cannot even login to the job node from the console on the machine. I
> also cannot delete the job from the queues. This means I then have to turn
> the queuing system off (qterm -t quick), remove the job files from the jobs
> directory (rm -rf /usr/local/torque/server_priv/jobs/2627.planck.*) and then
> restart the queuing system (/usr/local/torque/sbin/pbs_server -t warm). I
> then still have to reboot the nodes that were involved with that job. This
> is a problem.
>
> The main difference between the planck cluster and the student cluster is
> that the planck cluster has a GPFS parallel file system and does not use NFS
> (well actually, GPFS uses something like NFS). The problems we were seeing
> on the student clsuter disappeared when we recompiled with the extra '-assu
> buff' flag. I am recompiling wien2k on the planck cluster with this flag but
> it does not fix the problems.
>
> Other than that, both machines are running RHEL5.3 Operating System, and
> openmpi, fftw have been compiled the same way, as has wien2k.
>
>
>
>
> Thanks
>
> Said
>
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>
>



-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Research is to see what everybody else has seen, and to think what
nobody else has thought
Albert Szent-Gyorgi


More information about the Wien mailing list