[Wien] MPI parallelization
Jorissen Kevin
Kevin.Jorissen at ua.ac.be
Thu Apr 29 18:19:23 CEST 2004
Strange. From your previous e-mail, we learn that lapw1para goes into the main k-loop (that's where it says 'now creating klist 1'), but either sth goes wrong when it tries to execute the MPI command, or it doesn't get there.
The cat error surprises me : the only cat commands out there are for editing the definition file, and I don't see a reason for them to crash.
Some things you can do :
*check that the definition file is okay (you know, with the _1 for some files; in fact, it should look exactly like for k-point parallellization)
*since the klist_1 and def_1 are there, nothing stops you from launching the job yourself. Actually, lapw1para executes
set ttt=(`echo $mpirun | sed -e "s^_NP_^$number_per_job[$p]^" -e "s^_EXEC_^${exe}_mpi ${def}_$loop.def^" -e "s^_HOSTS_^.machine[$p]^"`)
(cd $PWD;$t $ttt;rm -f .lock_$lockfile[$p]) >>.time1_$loop &
maybe from this you can work out the necessary command by yourself.
*To see exactly where things fail, edit lapw1para (vi $WIENROOT/lapw1para), and in the first line, change the /bin/csh-f to /bin/csh-xf. Now run the script again, but capture all output with nohup (nohup x lapw1 -p). You'll now see a large file nohup.out containing all instructions as executed by the program.
Check out where it died, and which cat is letting you down.
*Wait for a user who knows more about mpi and can help you better. I confess that I've never actually used mpi parallellization myself - I'm just very familiar with the lapw1para.
* There are other files you may consult. Eg, :parallel, .lapw1para, etc.
Good luck, we'll work this out,
Kevin.
-----Original Message-----
From: Griselda Garcia [mailto:ggarcia at fis.puc.cl]
Sent: Thu 4/29/2004 2:42 PM
To: wien at zeus.theochem.tuwien.ac.at
Cc:
Subject: Re: [Wien] MPI parallelization
Hello Kevin,
Thanks for your reply ...
> As to why lapw1 would not run when lapw0 would ...
> * you're sure your input is correct? (ie, x lapw1 works?)
Yes, I am sure because I have done the same calculation in serial
version and it finished ok.
> * has the program been compiled correctly? As many recent e-mails
show, it's lapw1 which is tricky ...
The compilation has been correct, neither errors or warnings were obtained.
> * maybe sth in the setup of your cluster affects lapw1 but not lapw0
(can't think of anything, though)
I do not know ... I will talk again with the sys. adm..
> Could you confirm that lapw1 HAS actually crashed? ie, that the
partial error files contain an error
> message, that the output is clearly not complete ... It seems it
takes the machine about 2 seconds to
> crash, which is not much but enough for a simple test case.
Yes, lapw1 has actually crashed ... I do not have partial error files
... if I run just the mpi version of lapw1 even with the machines that
I showed in my previous mail, I have;
[griselda at clustersvr sd_v2]$ lapw1c_mpi uplapw1.def
Using 1 processors, My ID = 0
If I run the script runsp_lapw -p and verify the def and error files
inthe work dir, I have:
[griselda at clustersvr sd_v2]$ runsp_lapw -p &
[1] 2871
FORTRAN STOP LAPW0 END
FORTRAN STOP LAPW0 END
FORTRAN STOP LAPW0 END
FORTRAN STOP LAPW0 END
FORTRAN STOP LAPW0 END
FORTRAN STOP LAPW0 END
FORTRAN STOP LAPW0 END
FORTRAN STOP LAPW0 END
FORTRAN STOP LAPW0 END
FORTRAN STOP LAPW0 END
FORTRAN STOP LAPW0 END
FORTRAN STOP LAPW0 END
FORTRAN STOP LAPW0 END
FORTRAN STOP LAPW0 END
FORTRAN STOP LAPW0 END
FORTRAN STOP LAPW0 END
FORTRAN STOP LAPW0 END
cat: No match.
[1]+ Exit 1 runsp_lapw -p
[griselda at clustersvr sd_v2]$ ls *.def *.error
lapw0.def lapw0.error uplapw1_1.def uplapw1.def uplapw1.error
The cluster is configured in such a way that each node mount the server
home directory using NFS, is it right doing that?
Which things should the sys adm verify in the set up of the cluster to
get WIEN running?
Thanks a lot!!
Griselda.
_______________________________________________
Wien mailing list
Wien at zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 7622 bytes
Desc: not available
Url : http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20040429/d142d41d/attachment.bin
More information about the Wien
mailing list