[Wien] Cannot run kpoint parallel jobs - only serial.

Laurence Marks L-marks at northwestern.edu
Wed May 29 23:37:42 CEST 2013


I strongly suggest that you break things into pieces and test
seperately using an interactive queue, which will make many things
simpler.
a) Is fftw compiled using the mvapich mpicc?
b) Did you try just mvapich (I never got mvapich2 to work right)?
c) Try, interactively, x lapw0 -p (with a hand created .machines file
and interactive) for a simple system.
d) Have you followed the Intel linker suggestions?
e) Are you using the intel compiler (recommended)
f) Are you using a version of mvapich compiled with icc? It may not
matter, but it is safer.
g) Don't use 1 core/k-point, do something simpler such as
12:n0509
12:n0510
12:n0520
11:n0523
lapw0: n0509:8


On Wed, May 29, 2013 at 2:58 PM, Robert Nichol
<nichol.18 at buckeyemail.osu.edu> wrote:
> Dear community.
>
>
>
> I've been utilizing WIEN for a little while now and have spent considerable
> time trying to get the code to work on the Ohio Super Computing Center's (
> PBS queue system) Oakley Cluster.
>
> If I submit the script for k-point parallelization lapw2 to crashes.
>
> If I submit the script for mpi parallelization, either I get an error in
> lapw0, or lapw0 hangs indefinitely.
>
> It is worth mentioning  that OSC utilizes MVAPICH2 which is an OSU creation
> derived from MPICH
>
>
>
> **************************************
>
> [osu6811 at oakley02 ti-c-24]$ module avail mvapich
>
>
>
> ----------------------------------------
> /usr/local/share/lmodfiles/intel-12.1
> -----------------------------------------
>
>   mvapich2/1.7 (default)      mvapich2/1.9
> mvapich2/1.9a2-profiling
>
>   mvapich2/1.7-r5140          mvapich2/1.9-profiling    mvapich2/1.9b
>
>   mvapich2/1.7-r5140-hwloc    mvapich2/1.9a
> mvapich2/1.9b-profiling
>
>   mvapich2/1.8-r5668          mvapich2/1.9a2            mvapich2/1.9rc1
>
>   mvapich2/1.8.1              mvapich2/1.9a2-dbg
>
>
>
> ----------------------------------------
> /usr/local/share/lmodfiles/modulefiles
> ----------------------------------------
>
>   hpctoolkit/5.3.2-r3950-mvapich2-1.7    hpctoolkit/5.3.2-r3950-mvapich2-1.9
> (default)
>
>   hpctoolkit/5.3.2-r3950-mvapich2-1.8
>
> **************************************
>
>
>
> This will be a relatively long email as I attempted to be thorough.  I have
> attached my
>
>
>
> I have run serial jobs that give reasonable results.
>
>
>
> below is my script for k-point parallel jobs
>
> **
>
> Attached
>
> **
>
>
>
> example .machines file generated by kp.pbs
>
> **
>
> Attached
>
> **
>
>
>
>  47 k points in case.klist and 47 cores in the .machines file.
>
>
>
> The lapw0 and lapw1_1 – lapw1_47 error files are all empty.
>
> **************************************
>
> [osu6811 at oakley02 ti-c-24]$ cat lapw2.error
>
> Error in LAPW2
>
>  'LAPW2' - can't open unit: 30
>
>  'LAPW2' -        filename: ti-c-24.energy_11
>
> **  testerror: Error in Parallel LAPW2
>
> [osu6811 at oakley02 ti-c-24]$
>
> **************************************
>
>
>
>
>
>
>
> not only is case.energy_11 missing, but so is case.energy12 and every 12th
> case.energy file after those (11/12/23/24/35/36/47 are all missing.)
>
>
>
>
>
>
>
> contents of  a case.dayfile
>
> **
>
> Attached
>
> **
>
>
>
> Interesting to note this line: running lapw0 in single mode
>
> Apparently lapw0 is not run in parallel and we may own the successful
> termination of lapw0 to that fact.
>
>
>
>
>
>
>
>
>
> Information on the login node
>
> **************************************
>
> [osu6811 at oakley01 ~]$ lsb_release -a
>
> LSB Version:
> :core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
>
> Distributor ID: RedHatEnterpriseServer
>
> Description:    Red Hat Enterprise Linux Server release 6.3 (Santiago)
>
> Release:        6.3
>
> Codename:       Santiago
>
> [osu6811 at oakley01 ~]$ uname -a
>
> Linux oakley01.osc.edu 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22
> EST 2012 x86_64 x86_64 x86_64 GNU/Linux
>
> [osu6811 at oakley01 ~]$ cat /proc/version
>
> Linux version 2.6.32-220.7.1.el6.x86_64
> (mockbuild at x86-002.build.bos.redhat.com) (gcc version 4.4.6 20110731 (Red
> Hat 4.4.6-3) (GCC) ) #1 SMP Fri Feb 10 15:22:22 EST 2012
>
> [osu6811 at oakley01 ~]$ ifort -v
>
> ifort version 12.1.4
>
> [osu6811 at oakley01 ~]$ gcc -v
>
> Using built-in specs.
>
> Target: x86_64-redhat-linux
>
> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
> --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla
> --enable-bootstrap --enable-shared --enable-threads=posix
> --enable-checking=release --with-system-zlib --enable-__cxa_atexit
> --disable-libunwind-exceptions --enable-gnu-unique-object
> --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk
> --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre
> --enable-libgcj-multifile --enable-java-maintainer-mode
> --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib
> --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686
> --build=x86_64-redhat-linux
>
> Thread model: posix
>
> gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC)
>
> [osu6811 at oakley01 ~]$
>
> **************************************
>
>
>
>
>
>
>
> I am running WIEN2k version 12.1
>
>
>
>
>
>
>
>
>
> My .bash_profile
>
> **************************************
>
> [osu6811 at oakley02 ~]$ cat .bash_profile
>
> [ -f .bashrc ] && source .bashrc
>
> [osu6811 at oakley02 ~]$
>
> **************************************
>
>
>
> I have attached my .bashrc
>
>
>
> Thank you for taking the time.  I apologize in advance for the bulk of this
> email.
>
>
>
>
>
> Appreciatively,
>
>
>
> Robert Nichol
>
>
>
>



-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what
nobody else has thought"
Albert Szent-Gyorgi


More information about the Wien mailing list