k8s-cluster-simulator: A simulator for evaluating Kubernetes schedulers

Area

Cluster / Networking

Large Scale Distributed Deep Learning

Tag

# Internship

# Kubernetes

# ML Cluster

Daisuke Taniwaki

Engineer

Overview

We’re happy to release an open source, Kubernetes cluster simulator, called k8s-cluster-simulator. The simulator is in the alpha release, and was created by Hidehito Yabuuchi, a PFN internship student in 2018 and part-time employee, along with his mentors, Daisuke Taniwaki and Shingo Omura. This simulator simulates workloads of a Kubernetes cluster and time clock so you can evaluate your Kubernetes scheduler without actually deploying it in the production site.

Motivation

We have large on-premise GPU clusters, in which researchers run ML jobs of various running duration via Kubernetes. One of our goals is to maximize the utilization of the GPUs for cost-effectiveness while enabling all researchers to have reasonable access. To do this, we developed our own private Kubernetes scheduler and extender (e.g. kube-throttler). However, it’s hard to evaluate new logic in production, because researchers are running jobs, and we should not change the scheduling logic and fairness so often. Of course, we cannot deploy a buggy scheduler that stops the researchers work. Moreover, it is not desirable to stop research to test new scheduling logic in large clusters. Therefore, we started to develop a scheduler simulator for Kubernetes.

Design

We believe the simulator should have the following properties.

Require as few changes on scheduler’s implementation and interface as possible.
Simulate clock time to accelerate evaluations and also evaluate scheduling logics without being affected by system latencies such as network and internal processes.
Simulate workloads as flexibly as possible.
Support various output formats for further analysis.

Architecture

Here’s the simple flow diagram.

The idea is simple. The simulator simulates clocks and ticks the simulated clock at each step of the loop. At each step, the simulator asks submitters if they have pods which should be submitted or deleted in this clock, and schedule the submitted pods to scheduler. Scheduler returns bind and delete events so the simulator can simulate the resource management. Finally, the simulator writes metrics of simulation by metrics loggers.

And here’s the high-level class diagram.

We provide the following two points of customizations for scheduling simulations.

Submitters

Multiple users can be simulated by adding any number and combination of submitters, with time and number of pods submitted fully customizable through the simulator interface. For example, assume user A tends to submit more pods in the morning and user B tends to submit more pods in the evening. A submitter can be created for each user and plugged into the simulator.
Moreover, as submitters receive metrics from the simulator, they can change behaviors based on the state of a cluster, such as crowded or not.

Scheduler

You have two options for scheduler extensions, depending on the style of Kubernetes scheduler customization. The first scheduler extension mimics the normal Kubernetes scheduler (kube-scheduler), and can be extended with Prioritizer, Extender and Predicate. If you customize your scheduling logic by these kube-scheduler extension points, this is the best approach. As Kubernetes scheduler is a queue-based scheduler, you may want to implement more complicated scheduling logic that doesn’t fit a queue based scheduler, for example, scheduling a new set of pods immediately after receiving multiple pod submissions. For this case, we provide an option to evaluate a scheduler with the interface defined in Kubernetes with a thin wrapper function.

Roadmap

We’re implementing the following features before the beta phase to support more realistic cluster environments simulations.

More isolation between components (e.g. supporting RPC interface for a scheduler and submitter)
Provide common submitter implementations (e.g. typical probabilistic distributions(Uniform, Binomial, Poisson, etc.))
Support various cluster events (node failures, accidental pods failures, node addition/removal, etc.)
Support plottable output formats in popular plotter tools (matplotlib, gnuplot etc.)