Linux Symposium

Optimizing Shared Resource Contention in HPC Clusters

Sergey Blagodurov

Contention for shared resources in HPC clusters occurs when jobs are concurrently executing on the same multicore node (there is a contention for allocated CPU time, shared caches, memory bus, memory controllers, etc.) and when jobs are concurrently accessing cluster interconnects as their processes communicate data between each other. The cluster network also has to be used by the cluster scheduler to migrate jobs across the nodes. Contention for cluster shared resources incurs severe degradation to workload performance and stability and hence must be addressed. We found that the state-of-the-art HPC cluster schedulers are not contention-aware. The goal of this work is the design, implementation and evaluation of a scheduling algorithm that optimizes shared resource contention in an HPC cluster environment under Linux. Depending on the particular cluster and workload needs, several optimization goals (general performance improvement for the cluster workload, performance boost for the selected jobs, reduction in power consumption) can be pursued.

We currently use last level cache missrate as a metric to describe the intensiveness of the jobs in competing for the memory hierarchy within the cluster compute node and the total traffic passed between job processes as a metric for describing the level of contention for interconnects. Currently, we are working on improving the precision of our metrics. For example, LLC missrate can be potentially used with different parameters that describe the memory access patterns of the workload. The network traffic metric can be revised to distinguish between different jobs executing on the node and sending/receiving traffic. We are also working on identifying other sources of performance degradation in HPC clusters and on devising a descriptive metrics for them. We would be very interested to hear expert opinions on our ideas and, possibly, a suggestions about the further direction in which our research should go.

Policies   |   Media Archives