[Wien] problems in wien2k 10 run

Fri May 20 01:45:36 CEST 2011

I am going to post my solution to similar issues, as maybe it will help others.

In parallel_options on a big cluster that uses GPFS I have:

setenv USE_REMOTE 1
setenv MPI_REMOTE 0
setenv WIEN_GRANULARITY 1
setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH -np _NP_
-machinefile _HOSTS_ _EXEC_"
set a=`grep -e "1:" .machines | grep -v lapw0 | head -1 | cut -f 3 -d:
| cut -c 1-2`
setenv MKL_NUM_THREADS $a
setenv OMP_NUM_THREADS $a
setenv MKL_DYNAMIC FALSE
if (-e local_options ) source local_options
set remote = "/bin/csh $WIENROOT/pbsh"
set delay   = 0.25

and $WIENROOT/pbsh is just the line

mpirun -x LD_LIBRARY_PATH -x PATH -np 1 --host $1 /bin/csh -c " $2 "

This combination does several things:
a) It sets up the MKL threading environment correctly for the first
node, since with qsub/msub etc you don't know exactly what you will
have. (I don't know how to do this for others, maybe someone else
does.)
b) It runs everything using mpirun (rather than ssh) which works
better with torque/openmpi and avoids orphans that can happen with ssh
when jobs go wrong.

Caveat: Please don't start tweaking these things unless you really
know what you are doing!

N.B., Wien2k is one of the best OS/Hardware fault finder that I know.
I have found problems on several different clusters which left
sysadmins slightly red-faced....

On Thu, May 19, 2011 at 11:22 AM, Laurence Marks
<L-marks at northwestern.edu> wrote:
> For GPFS (and many NFS) '-assu buff' is needed and in fact I persuaded
> the people here to make this default using the environmental parameter
> which many other clusters do. Is this GPFS from IBM? If so, there is
> in fact a "bug" in it which Wien2k triggers by writing to the same
> file from different cores/CPUs. This is fixed with 11, but maybe not
> in the version you have. Compare errflg.f in lapw[0-2], it should be
> something like
>
>      if(myid.eq.0) then
>         OPEN (99,FILE=FNAME,ERR=900)
>         WRITE (99,9000) MSG
> !         CLOSE (99)
>      endif
>
> in the recent versions.
>
> There is also a potential bug in terms of how you are running mpi jobs
> as Wien2k uses ssh by default and torque does not know to kill this
> and many versions of ssh do not propogate kill signals to children.
> There is a discussion on this in this that I started at
> http://www.open-mpi.org/community/lists/users/2011/04/16085.php . I
> have some meetings, but if this is similar to what you have contact me
> offline as there is a modification to parallel_options that seems to
> work.
>
> 2011/5/19 saed alazar <q_saed74 at yahoo.com>:
>> The code works just fine on the student cluster where we have compiled it
>> using the extra '-assu buff' flag to alleviate NFS problems. I think we have
>> seen that the optimisation jobs work well there.
>>
>> There are still problems with the optimisation jobs running on the planck
>> cluster though.
>>
>> It seems that there are 2 main problems:
>>
>> 1   One node appears to be doing nearly all the work while the others do
>> little... as seen in the dayfile. However if we login to the job nodes while
>> the job is running and run the top command, all cores seem to be using ~100%
>> cpu, load is normal. Also 'cat /proc/meminfo' shows there is plenty of free
>> memory (there should be as these nodes each have 32GB RAM).
>> 2   After some time it becomes impossible to login to your home directory
>> and I cannot even login to the job node from the console on the machine. I
>> also cannot delete the job from the queues. This means I then have to turn
>> the queuing system off (qterm -t quick), remove the job files from the jobs
>> directory (rm -rf /usr/local/torque/server_priv/jobs/2627.planck.*) and then
>> restart the queuing system (/usr/local/torque/sbin/pbs_server -t warm). I
>> then still have to reboot the nodes that were involved with that job. This
>> is a problem.
>>
>> The main difference between the planck cluster and the student cluster is
>> that the planck cluster has a GPFS parallel file system and does not use NFS
>> (well actually, GPFS uses something like NFS). The problems we were seeing
>> on the student clsuter disappeared when we recompiled with the extra '-assu
>> buff' flag. I am recompiling wien2k on the planck cluster with this flag but
>> it does not fix the problems.
>>
>> Other than that, both machines are running RHEL5.3 Operating System, and
>> openmpi, fftw have been compiled the same way, as has wien2k.
>>
>>
>>
>>
>> Thanks
>>
>> Said
>>
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>>
>
>
>
> --
> Laurence Marks
> Department of Materials Science and Engineering
> MSE Rm 2036 Cook Hall
> 2220 N Campus Drive
> Northwestern University
> Evanston, IL 60208, USA
> Tel: (847) 491-3996 Fax: (847) 491-7820
> email: L-marks at northwestern dot edu
> Web: www.numis.northwestern.edu
> Chair, Commission on Electron Crystallography of IUCR
> www.numis.northwestern.edu/
> Research is to see what everybody else has seen, and to think what
> nobody else has thought
> Albert Szent-Gyorgi
>

-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Research is to see what everybody else has seen, and to think what
nobody else has thought
Albert Szent-Gyorgi