[Wien] A trick for mpi debugging

Laurence Marks L-marks at northwestern.edu
Wed Aug 21 21:11:03 CEST 2013


What you write makes complete sense. I have seen too many times now
expectations for what users will do set by sys_admins and one has to
fight around it. One of the worst was when on a cluster that will
remain nameless it was decided that users should not be allowed to
change the resources they had (the ulimit command) because only small
files, stack size etc would ever be needed. It broke Wien2k
completely.

On Wed, Aug 21, 2013 at 1:32 PM, Luis Ogando <lcodacal at gmail.com> wrote:
> Dear Prof. Marks,
>
>    First of all, thank you very much for your help !
>    Unfortunately, your suggestions did not work in my SGI system. Despite of
> this, I have now WIEN2k working in parallel even when more than one node is
> used. My solution where to install OpenMPI with ifort and icc in the SGI
> machine and use them to compile and run WIEN2k.
>    We saw that mpiexec-mpt does not allow the use of a "machinefile" built
> by the user (at least, this can not be done by a beginner like me). As the
> Intel MPI is installed by the vendor (SGI team), I believe that it is
> somehow configured in a similar way. As a result, when I tried the
> compilation and execution with Intel MPI, I got some error messages
> complaining about the -machinefile option. When I tried your suggestion of
> compiling with Intel MPI but using the hopen file to launch the job with
> OpenMPI, the error messages complained about the -bootstrap-exec option.
>    Well, it looks like that the best option is to use compilers and MPI
> softwares not optimized for an specific system by others.
>    Thank you again !
>    All the best,
>                    Luis
> PS: in the parallel_options file, I had to set the complete path for the
> OpenMPI mpirun, despite of defining it in my .bashrc
>
>
> 2013/8/3 Laurence Marks <L-marks at northwestern.edu>
>>
>> I am not sure if I can give you the right answer; My guess is to have
>> it as 1, but I do not know all the details of your system and if I
>> remember right you have an sgi system. Try both, then let us/me know
>> what works (or does not).
>>
>> For reference, I have it working fine with USE_REMOTE 1, and I don't
>> currently want to change to test (particularly as I am on travel).
>>
>> On Fri, Aug 2, 2013 at 8:36 AM, Luis Ogando <lcodacal at gmail.com> wrote:
>> > Dear Prof. Marks,
>> >
>> >    Just a quick question : in case that the openmpi launcher replaces
>> > ssh,
>> > should I change USE_REMOTE to 0 in a cluster ?
>> >    Thank you one more time,
>> >                 Luis
>> >
>> >
>> >
>> > 2013/7/27 Laurence Marks <L-marks at northwestern.edu>
>> >>
>> >> WARNING 1: To be used with care, and customized as needed
>> >> WARNING 2: Valid for impi and perhaps other, but not all variants
>> >> WARNING 3: Please look at what these options mean...
>> >>
>> >> My parallel_options file with NU's supercomputer, which contains
>> >> various debug and other options (some recommended by Intel, some by
>> >> the local sys_admin):
>> >>
>> >> setenv USE_REMOTE 1
>> >> setenv MPI_REMOTE 0
>> >> setenv WIEN_GRANULARITY 1
>> >> setenv DAPL_DBG_TYPE 0
>> >> # Normal
>> >> #setenv WIEN_MPIRUN "mpirun -n _NP_ -machinefile _HOSTS_ _EXEC_ "
>> >>
>> >> # To turn on verbose
>> >> #setenv WIEN_MPIRUN "mpirun -bootstrap-exec ~/bin/hssh -n _NP_
>> >> -machinefile _HOSTS_ _EXEC_ "
>> >>
>> >> # To use more recent, privately compiled ssh
>> >> #setenv WIEN_MPIRUN "mpirun -bootstrap-exec $HOME/local/bin/ssh -n
>> >> _NP_ -machinefile _HOSTS_ _EXEC_ "
>> >>
>> >> # To use openmpi to launch
>> >> setenv WIEN_MPIRUN "mpirun -bootstrap-exec $WIENROOT/hopen -n _NP_
>> >> -machinefile _HOSTS_ _EXEC_ "
>> >>
>> >> set sleepy = 0.2
>> >> set delay = 0.1
>> >> unset DAPL_DBG
>> >> #Turn on Hydra debug on Quest
>> >> #setenv I_MPI_HYDRA_DEBUG 1
>> >> #Turn on MPI DEBUG
>> >> #setenv I_MPI_DEBUG 1
>> >> #setenv I_MPI_DEBUG_OUTPUT mpi_debug%h_%r
>> >> setenv I_MPI_FABRICS_LIST dapl,tcp
>> >> setenv I_MPI_FALLBACK enable
>> >>
>> >>
>> >>
>> >>
>> >> On Sat, Jul 27, 2013 at 2:53 PM, Luis Ogando <lcodacal at gmail.com>
>> >> wrote:
>> >> > Dear Prof. Marks,
>> >> >
>> >> >    Could you, please, send me a template for the parallel_options
>> >> > file
>> >> > where
>> >> > this implementation was done ?
>> >> >    I am sorry for that, but I am really far from being an expert.
>> >> >    All the best,
>> >> >                     Luis
>> >> >
>> >> >
>> >> > 2013/7/22 Laurence Marks <L-marks at northwestern.edu>
>> >> >>
>> >> >> A brief followup which may be useful (or not) for others in the
>> >> >> future
>> >> >> with mpi problems. I have been able to work around a mysterious
>> >> >> impi/ssh bug on NU's supercomputer by replacing ssh by the
>> >> >> openmpi/mpirun launcher. The hack is gross, but very stable.
>> >> >>
>> >> >> Step 1:
>> >> >> 1) Add "--bootstrap-exec=$WIENROOT/hopen" to
>> >> >> $WIENROOT/parallel_options.
>> >> >> 2) Create the executable file $WIENROOT/hopen containing
>> >> >> #!/bin/bash
>> >> >> a=`echo $@ | sed -e 's/-x -q//'`
>> >> >> $OPENMPI/bin/mpirun -np 1 --host $a
>> >> >>
>> >> >> (change $OPENMPI to where it has been compiled).
>> >> >>
>> >> >> On Thu, Jul 18, 2013 at 10:38 AM, Laurence Marks
>> >> >> <L-marks at northwestern.edu> wrote:
>> >> >> > On a cluster I am using I am having a problem with ssh connections
>> >> >> > as
>> >> >> > part of impi/mpirun about 0.1-0.2% of the time; what happens is
>> >> >> > that
>> >> >> > they fail to launch and become zombie's (ps shows "[ssh]
>> >> >> > <defunct>").
>> >> >> > Since fiddling through all the options within mpirun can be hard
>> >> >> > (particularly for impi which is rather fast), I found (after a
>> >> >> > comment
>> >> >> > from someone on the openssh list) a useful hack. I am providing it
>> >> >> > here as it is a nice way around things, and might be useful to
>> >> >> > others
>> >> >> > in the future.
>> >> >> >
>> >> >> > The "trick" is to add --bootstrap-exec ~/bin/hssh or similar to
>> >> >> > the
>> >> >> > mpirun line in $WIENROOT/parallel_options, then create the
>> >> >> > executable
>> >> >> > ~/bin/hssh with something similar to:
>> >> >> >
>> >> >> > #!/bin/bash
>> >> >> > a=`echo $@ | sed -e 's/-q/-v/'`
>> >> >> > ssh $a
>> >> >> >
>> >> >> >
>> >> >> > The above allows me to turn verbose output on in the ssh command
>> >> >> > since
>> >> >> > impi insists on setting -q (quiet). For other cases something
>> >> >> > similar
>> >> >> > can be done.
>> >> >> >
>> >> >> > --
>> >> >> > Professor Laurence Marks
>> >> >> > Department of Materials Science and Engineering
>> >> >> > Northwestern University
>> >> >> > www.numis.northwestern.edu 1-847-491-3996
>> >> >> > "Research is to see what everybody else has seen, and to think
>> >> >> > what
>> >> >> > nobody else has thought"
>> >> >> > Albert Szent-Gyorgi
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Professor Laurence Marks
>> >> >> Department of Materials Science and Engineering
>> >> >> Northwestern University
>> >> >> www.numis.northwestern.edu 1-847-491-3996
>> >> >> "Research is to see what everybody else has seen, and to think what
>> >> >> nobody else has thought"
>> >> >> Albert Szent-Gyorgi
>> >> >> _______________________________________________
>> >> >> Wien mailing list
>> >> >> Wien at zeus.theochem.tuwien.ac.at
>> >> >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> >> >> SEARCH the MAILING-LIST at:
>> >> >>
>> >> >> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Professor Laurence Marks
>> >> Department of Materials Science and Engineering
>> >> Northwestern University
>> >> www.numis.northwestern.edu 1-847-491-3996
>> >> "Research is to see what everybody else has seen, and to think what
>> >> nobody else has thought"
>> >> Albert Szent-Gyorgi
>> >> _______________________________________________
>> >> Wien mailing list
>> >> Wien at zeus.theochem.tuwien.ac.at
>> >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> >> SEARCH the MAILING-LIST at:
>> >> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>> >
>> >
>>
>>
>>
>> --
>> Professor Laurence Marks
>> Department of Materials Science and Engineering
>> Northwestern University
>> www.numis.northwestern.edu 1-847-491-3996
>> "Research is to see what everybody else has seen, and to think what
>> nobody else has thought"
>> Albert Szent-Gyorgi
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at:
>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>



-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what
nobody else has thought"
Albert Szent-Gyorgi


More information about the Wien mailing list