[Wien] Parallel execution in clusters

Wed Dec 7 12:57:02 CET 2016

Dear Rajiv,

It is impossible to answer your question, since we have no idea what kinds of jobs you are trying to run (# of k-points, matrix size, memory requirements, etc) or the capabilities of your cluster (e.g. I guess your nodes do not have 56 cores). It is best to work with your cluster administrator to answer these kinds of performance questions. You will have to test the performance and scaling of WIEN2k of your cluster yourself, since we do not have access to it.

A few hints:

*         The case.dayfile will contain output from the unix ‘time’ command between the various steps in the SCF cycle, which can help you answer the question of which job options you choose are faster. For example, after lapw0 runs:

-------- .machine0 : 12 processors

3.254u 0.122s 0:02.36 142.7%    0+0k 0+120io 0pf+0w

The :log file also gives you the time each command is started.

*         WIEN2k can look bad on cluster CPU efficiency metrics (depending on the job) because it is I/O bandwidth intensive, i.e. reading and writing large vectors to disk takes walltime but not cputime.

*         A good question to ask is how your job is scaling with the resources you give it. If you give a job 32 cores and it completes an SCF cycle in 5 minutes, but 64 cores of the same job takes 4 minutes, you probably are hitting a scaling limit (usually node I/O) and wasting resources.

*         From your screenshot, it looks like you are running more than one job simultaneously  on node001, since lapw1 and lapw2c are running at the same time. Running a more controlled test will help you better measure job performance/scaling.

Beyond that I can’t be much help. Good luck and remember to search the mailing list!

--

Dr. Eamon McDermott

CEA Grenoble

DRT/LETI/DTSI/SCMC

From: Wien [mailto:wien-bounces at zeus.theochem.tuwien.ac.at] On Behalf Of Rajiv Chouhan
Sent: Saturday, December 03, 2016 19:59
To: A Mailing list for WIEN2k users <wien at zeus.theochem.tuwien.ac.at>
Subject: Re: [Wien] Parallel execution in clusters

Hi All,

Please reply to my previous email. It is very important to understand the code running in parallel mode in the cluster. 

Thank you,

Rajiv

On Wed, Nov 30, 2016 at 10:53 AM, Rajiv Chouhan <chouhanrajiv14 at gmail.com <mailto:chouhanrajiv14 at gmail.com> > wrote:

Hi,

I am using WIEN2k in cluster with mpi parallelization. I have written the script to run the job in machine using pbs script in the slurm platform of the same cluster. I have attached the link to the snapshot of the jobs running in the cluster in two nodes001 and node004. In both the nodes the allocations are  52 and 56 respectively. The top command shows the utilization shown in snapshot attached. Can anyone tell me which of the node is utilizing full resources and which execution is faster. Again jobs of "raji..+" (in node001) are submitted with pbs and other users (node004) are submitted with slurm script.

 https://www.dropbox.com/s/a8ikape71f43cva/Capture-02.jpg?dl=0

Thank you,

Rajiv

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20161207/0fd0020c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5390 bytes
Desc: not available
URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20161207/0fd0020c/attachment.p7s>