[Wien] incorrect distribution over parallel nodes

Peter Blaha pblaha at zeus.theochem.tuwien.ac.at
Wed Nov 24 23:04:29 CET 2004


This is probably due to an attempted load balencing, which is fine and 
desirable without scratch disks. 
Include   granularity:1    in .machines

This should fix the distribution 1+2 k to node 1; 3+4 to node 2,....


> I found that with a relatively large number of nodes
> in simple parallelization sometimes the script
> incorrectly distrubutes the k-points over the nodes,
> which later results in a crash. Here is an example.
> 50 k-point were excuted on 25 machines. This is the
> .machines file:
> 1:compute-3-35.local
> 1:compute-3-34.local
> 1:compute-3-33.local
> 1:compute-3-32.local
> <skipped>
> 1:compute-3-11.local
> 1:compute-3-10.local
> 
> Now see the result of running the following command:
> >foreach f ( `sed 's/1://' .machines` )
> > echo $f
> > ssh $f "ls -trs /scratch/mazin/RUN5*"
> > echo ""
> >end
> 
> Note below that
> 
> 1680 /scratch/mazin/RUN5.vectorup_1
> 1780 /scratch/mazin/RUN5.vectorup_26
> 1680 /scratch/mazin/RUN5.vectordn_1
> 1780 /scratch/mazin/RUN5.vectordn_26
> 6608 /scratch/mazin/RUN5.vectorsoup_1
> 6608 /scratch/mazin/RUN5.vectorsodn_1
> 7012 /scratch/mazin/RUN5.vectorsoup_26
> 7012 /scratch/mazin/RUN5.vectorsodn_26
> 
> compute-3-34.local
> 1716 /scratch/mazin/RUN5.vectorup_2
> 1800 /scratch/mazin/RUN5.vectorup_27
> 1716 /scratch/mazin/RUN5.vectordn_2
> 1800 /scratch/mazin/RUN5.vectordn_27
> 6748 /scratch/mazin/RUN5.vectorsoup_2
> 6748 /scratch/mazin/RUN5.vectorsodn_2
> 7092 /scratch/mazin/RUN5.vectorsoup_27
> 7092 /scratch/mazin/RUN5.vectorsodn_27
> 
> <skipped 1 "correct" node >
> 
> compute-3-32.local
> 1768 /scratch/mazin/RUN5.vectorup_4
> 1744 /scratch/mazin/RUN5.vectorup_35
> 1768 /scratch/mazin/RUN5.vectordn_4
> 1800 /scratch/mazin/RUN5.vectordn_29
> 6960 /scratch/mazin/RUN5.vectorsoup_4
> 6960 /scratch/mazin/RUN5.vectorsodn_4
>    0 /scratch/mazin/RUN5.vectorup_29
>    0 /scratch/mazin/RUN5.vectorsoup_29
>    0 /scratch/mazin/RUN5.vectorsodn_29
> 
> <skipped 6 "incorrect" nodes>
> 
> compute-3-25.local
> 1792 /scratch/mazin/RUN5.vectorup_11
> 1744 /scratch/mazin/RUN5.vectorup_36
> 1792 /scratch/mazin/RUN5.vectordn_11
> 1744 /scratch/mazin/RUN5.vectordn_36
> 7056 /scratch/mazin/RUN5.vectorsoup_11
> 7056 /scratch/mazin/RUN5.vectorsodn_11
> 6860 /scratch/mazin/RUN5.vectorsoup_36
> 6860 /scratch/mazin/RUN5.vectorsodn_36
> 
> <the rest is correct>
> 
> 
> 		
> __________________________________ 
> Do you Yahoo!? 
> The all-new My Yahoo! - What will yours do?
> http://my.yahoo.com 
> 
> _______________________________________________
> Wien mailing list
> Wien at zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> 


                                      P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-15671             FAX: +43-1-58801-15698
Email: blaha at theochem.tuwien.ac.at    WWW: http://info.tuwien.ac.at/theochem/
--------------------------------------------------------------------------




More information about the Wien mailing list