[Wien] Error in parallel execution

Marcos Veríssimo Alves marcos.verissimo.alves at gmail.com
Tue Jul 27 19:04:35 CEST 2010


Well, bugging the poor sysadmin, we came up with a solution which, while not
very advisable, will be a working patch for the moment: open the write
permissions of my homedir on the afs system. At least now I can run my job
on 45 processors :)

Cheers,

Marcos

2010/7/27 Marcos Veríssimo Alves <marcos.verissimo.alves at gmail.com>

> Worse of all is that the disks are correctly mounted, and that from the
> command line I can do things like ls, even create and remove files. Only
> from within lapw1para it gives me an error. I am starting to insert lines
> with calls to unix utilities such as whoami in order to see what weird thing
> is going on there...
>
> Thanks all for the suggestions. If I track this bug down I'll let you know.
>
> Cheers,
>
> Marcos
>
>
> On Tue, Jul 27, 2010 at 5:26 PM, Laurence Marks <L-marks at northwestern.edu>wrote:
>
>> It is a system problem. Maybe the relevant disc is not mounted on the
>> remote node or something? Try doing a simple ssh to the node and test
>> things like ls, cd etc. Too many possibilities to list here. Good
>> luck, just try computer experiments until you track it down.....
>>
>> 2010/7/27 Marcos Veríssimo Alves <marcos.verissimo.alves at gmail.com>:
>> > Hi Laurence,
>> > I am not running mpi, only using rsh/ssh for the plain k-point
>> > parallelization. I couldn't really figure out how to make a .machines
>> file
>> > to run parallel over k-points on mpi, with one processor per machine.
>> > However, I think Stefaan's tip has gone right to the point: in my job
>> error
>> > file I get the following errors:
>> >  LAPW0 END
>> > .machinetmp222: No such file or directory
>> > bash: line 0: cd: /afs/atc.unican.es/u/m/mverissi/WIEN2k/sro1sto6:
>> > Permission denied
>> >  Cannot open error-file
>> > ERRFLG - couldn't open errorflag-file.
>> > The fact that from inside lapw1para the ssh command cannot cd to my home
>> > directory puzzles me... it seems to be a system problem, then. However,
>> if
>> > you have any suggsestions, they will be more than welcome!
>> > Thanks,
>> > Marcos
>> >
>> > On Tue, Jul 27, 2010 at 4:27 PM, Laurence Marks <
>> L-marks at northwestern.edu>
>> > wrote:
>> >>
>> >> I doubt (although I may be wrong) that this has anything to do with
>> >> the OS. Do you have -traceback in your compile options? This will give
>> >> information as to which program this is happening in. Also, are you
>> >> running mpi or not?
>> >>
>> >> 2010/7/27 Marcos Veríssimo Alves <marcos.verissimo.alves at gmail.com>:
>> >> > Hi Stefaan and Laurence,
>> >> > @Stefaan: I will try it.
>> >> > @Laurence: it's the latest version, which I have downloaded about two
>> >> > weeks
>> >> > ago. Hope this helps.
>> >> > Thanks,
>> >> > Marcos
>> >> > On Tue, Jul 27, 2010 at 3:47 PM, Laurence Marks
>> >> > <L-marks at northwestern.edu>
>> >> > wrote:
>> >> >>
>> >> >> Is this the latest version, or an older one? Some changes were made
>> in
>> >> >> the error file access in the latest version for mpi reasons.
>> >> >>
>> >> >> 2010/7/27 Marcos Veríssimo Alves <marcos.verissimo.alves at gmail.com
>> >:
>> >> >> > Hi all,
>> >> >> >
>> >> >> > I am experiencing a problem in the execution in parallel over
>> >> >> > k-points.
>> >> >> >
>> >> >> > I have compiled the code successfully in a cluster running Debian
>> >> >> > Linux
>> >> >> > and
>> >> >> > with SGEEE as the queue system using ssh as the means to launch
>> the
>> >> >> > instances on the remote nodes, with /bin/bash as the shell. My
>> script
>> >> >> > successfully creates a .machines file and when I run runsp_lapw -p
>> >> >> > -NI
>> >> >> > -cc
>> >> >> > 0.0001, the process dies. This is because, for some reason,
>> lapw1para
>> >> >> > is
>> >> >> > not
>> >> >> > being able to write to the up(dn)lapw1_*.error files:
>> >> >> >
>> >> >> > forrtl: severe (47): write to READONLY file, unit 99, file
>> >> >> > /afs/atc.unican.es/u/m/mverissi/WIEN2k/sro1sto6/uplapw1_1.error
>> >> >> >
>> >> >> > And the same happens to the dnlapw1_*.error files.
>> >> >> >
>> >> >> > lapw0, on the other hand, runs fine. I have set up parallel
>> execution
>> >> >> > successfully on my dual-core desktop using ssh, using pretty much
>> the
>> >> >> > same
>> >> >> > stuff, and it runs perfectly well.
>> >> >> >
>> >> >> > Now, I have changed the write permissions of the directory (and
>> all
>> >> >> > the
>> >> >> > files) with chmod -R ugo+rw /afs/atc.unican.es/u..., but to no
>> avail.
>> >> >> > Has
>> >> >> > anyone experienced any problem like this before? Could there be
>> any
>> >> >> > known
>> >> >> > (but obscure) reason why lapw1para would not be able to write to
>> its
>> >> >> > files,
>> >> >> > but lapw0para would?
>> >> >> >
>> >> >> > Best regards,
>> >> >> >
>> >> >> > Marcos
>> >> >> >
>> >> >> > _______________________________________________
>> >> >> > Wien mailing list
>> >> >> > Wien at zeus.theochem.tuwien.ac.at
>> >> >> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> >> >> >
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Laurence Marks
>> >> >> Department of Materials Science and Engineering
>> >> >> MSE Rm 2036 Cook Hall
>> >> >> 2220 N Campus Drive
>> >> >> Northwestern University
>> >> >> Evanston, IL 60208, USA
>> >> >> Tel: (847) 491-3996 Fax: (847) 491-7820
>> >> >> email: L-marks at northwestern dot edu
>> >> >> Web: www.numis.northwestern.edu
>> >> >> Chair, Commission on Electron Crystallography of IUCR
>> >> >> www.numis.northwestern.edu/
>> >> >> Electron crystallography is the branch of science that uses electron
>> >> >> scattering and imaging to study the structure of matter.
>> >> >> _______________________________________________
>> >> >> Wien mailing list
>> >> >> Wien at zeus.theochem.tuwien.ac.at
>> >> >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > Wien mailing list
>> >> > Wien at zeus.theochem.tuwien.ac.at
>> >> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Laurence Marks
>> >> Department of Materials Science and Engineering
>> >> MSE Rm 2036 Cook Hall
>> >> 2220 N Campus Drive
>> >> Northwestern University
>> >> Evanston, IL 60208, USA
>> >> Tel: (847) 491-3996 Fax: (847) 491-7820
>> >> email: L-marks at northwestern dot edu
>> >> Web: www.numis.northwestern.edu
>> >> Chair, Commission on Electron Crystallography of IUCR
>> >> www.numis.northwestern.edu/
>> >> Electron crystallography is the branch of science that uses electron
>> >> scattering and imaging to study the structure of matter.
>> >> _______________________________________________
>> >> Wien mailing list
>> >> Wien at zeus.theochem.tuwien.ac.at
>> >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> >
>> >
>> > _______________________________________________
>> > Wien mailing list
>> > Wien at zeus.theochem.tuwien.ac.at
>> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> >
>> >
>>
>>
>>
>> --
>> Laurence Marks
>> Department of Materials Science and Engineering
>> MSE Rm 2036 Cook Hall
>> 2220 N Campus Drive
>> Northwestern University
>> Evanston, IL 60208, USA
>> Tel: (847) 491-3996 Fax: (847) 491-7820
>> email: L-marks at northwestern dot edu
>> Web: www.numis.northwestern.edu
>> Chair, Commission on Electron Crystallography of IUCR
>> www.numis.northwestern.edu/
>> Electron crystallography is the branch of science that uses electron
>> scattering and imaging to study the structure of matter.
>> _______________________________________________
>> Wien mailing list
>> Wien at zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20100727/a3140502/attachment.htm>


More information about the Wien mailing list