Blog
Introduction
My name is Toru, an engineer at PFN, and I am part of the Cluster Services team which develops and operates the machine learning infrastructure. Recently, in addition to our in-house infrastructure, we have also begun developing and operating Preferred Computing Platform.
At PFN, we use Kubernetes as the orchestrator for running containers on our machine learning platform and operate it daily. This article introduces a topic where we implemented feedback received from internal operations in our daily operations of the machine learning platform and contributed back to the Kubernetes upstream.
PFN’s Cluster Team
The cluster team at PFN, which operates and maintains our machine learning infrastructure, is also committed to tracking the upgrades to Kubernetes versions. We continuously update our Kubernetes cluster, incorporating bug fixes, stability improvements, and performance enhancements, and we strive to promptly deliver new features to our users. Specifically, we regularly upgrade to one version below the latest. Furthermore, in our internal machine learning platform, there is a close relationship between the cluster team and our in-house engineers/researchers, which results in frequent feedback.
Out-Of-Memory Issue After Upgrading to Kubernetes v1.28
One day, after upgrading to Kubernetes v1.28, feedback was received that jobs which had previously been successful started failing due to OOM (Out-Of-Memory). Upon closer examination, it was revealed that “the job involved workloads where the memory usage of each entry couldn’t be determined until actually running, and it was implemented to compute another entry if a child process became OOM.”
Similarly, another team reported feedback for their interactive workload using the ‘make -k’ or ‘–keep-going’ options, which allows a build process to continue even if part of it fails due to OOM (e.g., make -j 20 -k || make -j 10).
Through internal investigation, it was found that a PR introduced in v1.28 (https://github.com/kubernetes/kubernetes/pull/117793) was the cause. This PR uses memory.oom.group available in cgroup v2 to forcibly terminate all processes in a container if any process within becomes OOM. More details about memory.oom.group will be explained in the following section.
Previously, if a Pod experienced an OOM, only one child process might be killed while the container as a whole could continue running healthily. Mature multi-process software could handle such child process OOMs correctly, but for many other types of software, it is more stable to term the whole container as OOM and forcibly terminate it if any process becomes unstable due to OOM.
Support for cgroup v2 has been Generally Available (GA) since Kubernetes v1.25. PFN switched to cgroup v2 in December 2022 at the same time as upgrading our Kubernetes version to v1.25. Consequently, this change in the PR had become a disruptive modification for PFN’s machine learning platform, which had already transitioned to cgroup v2.
What is memory.oom.group
In cgroup v2, memory.oom.group is one of the interface files for the memory controller. Writing 1 to this file enables the grouping of tasks within a cgroup and its descendants for collective Out-Of-Memory (OOM) killing. Conversely, the default value of 0 results in only some tasks being OOM killed.
To further understand the behavior of memory.oom.group, let’s use an actual example. First, prepare a cgroup named oom-test and set a memory limit of 128MB. We ensure OOM occurs by disabling the use of swap.
$ sudo mkdir /sys/fs/cgroup/oom-test $ echo 0 | sudo tee /sys/fs/cgroup/oom-test/memory.swap.max 0 $ echo 128M | sudo tee /sys/fs/cgroup/oom-test/memory.max 128M
When memory.oom.group is disabled
Checking the current settings, memory.oom.group is at 0, meaning it’s disabled. Whether an OOM has occurred in the specific cgroup can be checked with the memory.events file:
$ cat /sys/fs/cgroup/oom-test/memory.oom.group 0 $ cat /sys/fs/cgroup/oom-test/memory.events low 0 high 0 max 0 oom 0 oom_kill 0 oom_group_kill 0
Run two processes using the stress-ng command attempting to use 200MB of memory. Naturally, OOM occurs, and it’s evident from the logs that only one of the two processes is being OOM killed at a time. Since stress-ng is not entirely OOMed, forcibly terminate it with Ctrl+C:
$ bash -c 'echo $$ | sudo tee /sys/fs/cgroup/oom-test/cgroup.procs && stress-ng --vm 2 --vm-bytes 200M --vm-hang 0 -v | grep OOM' 2909100 stress-ng: debug: [2909108] vm: assuming killed by OOM killer, restarting again (instance 1) stress-ng: debug: [2909107] vm: assuming killed by OOM killer, restarting again (instance 0) ... stress-ng: debug: [2909107] vm: assuming killed by OOM killer, restarting again (instance 0) stress-ng: debug: [2909108] vm: assuming killed by OOM killer, restarting again (instance 1) ^C%
Checking memory.events shows multiple OOM and OOM kills have occurred:
$ cat /sys/fs/cgroup/oom-test/memory.events low 0 high 0 max 8916 oom 470 oom_kill 30 oom_group_kill 0
When memory.oom.group is enabled
Next, let’s enable memory.oom.group and conduct a similar experiment.
$ echo 1 | sudo tee /sys/fs/cgroup/oom-test/memory.oom.group 1
Using stress-ng for loading as before, unlike the previous instance, this time both stress-ng processes are terminated due to OOM:
$ bash -c 'echo $$ | sudo tee /sys/fs/cgroup/oom-test/cgroup.procs && stress-ng --vm 2 --vm-bytes 200M --vm-hang 0 -v' 2909320 stress-ng: debug: [2909320] invoked with 'stress-ng --vm 2 --vm-bytes 200M --vm-hang 0 -v' by user 1000 'utam0k' stress-ng: debug: [2909320] stress-ng 0.17.06 ... stress-ng: info: [2909320] dispatching hogs: 2 vm stress-ng: debug: [2909320] starting stressors stress-ng: debug: [2909320] 2 stressors started stress-ng: debug: [2909325] vm: [2909325] started (instance 0 on CPU 14) stress-ng: debug: [2909326] vm: [2909326] started (instance 1 on CPU 10) stress-ng: debug: [2909325] vm: using method 'all' zsh: killed bash -c
When checking the memory.events, you can see that the count for the item “oom_group_kill” has increased. Additionally, from dmesg, you can also confirm that OOM events have occurred.
$ cat /sys/fs/cgroup/oom-test/memory.events low 0 high 0 max 8952 oom 472 oom_kill 36 oom_group_kill 1 $ sudo dmesg | grep 2908280 [4745446.656277] [2908280] 1000 2908280 15858 799 351 448 0 81920 0 0 stress-ng-vm [4745446.659536] Memory cgroup out of memory: Killed process 2908280 (stress-ng-vm) total-vm:63432kB, anon-rss:1404kB, file-rss:1792kB, shmem-rss:0kB, UID:1000 pgtables:80kB oom_score_adj:0 $ sudo rmdir /sys/fs/cgroup/oom-test
As observed from the examples above, the settings of memory.oom.group significantly impact the behavior of containers and pods. Starting from Kubernetes v1.28, using cgroup v2 forces memory.oom.group to be set to 1.
Introduce the memory.oom.group setting to Kubernetes
As previously discussed, the behavior of the OOM Killer differs depending on whether memory.oom.group is enabled. As a part of the community and PFN, we believe it should be up to the users to decide whether to enable memory.oom.group. Consequently, we provided feedback upstream and managed to merge a Pull Request.
Internally, due to the urgency of resolving this issue, we had applied a custom patch to kubelet. However, operating with such patches should be temporary considering future technical debt, and an upstream fix in Kubernetes is ideal. Thus, we worked on adding these changes to the upstream. Engaging with the Kubernetes community was essential in this process.
Taking Over the Pull Request
We initially commented on the need to add an option or a flag to disable memory.oom.group to address this issue. This was discussed in SIG-Node, and it was decided to add an option to disable this feature. Following this, a Pull Request was made to add the actual flag, and it reached the \lgtm stage. However, there was an opinion that if this feature was to be added, it should be controlled at the container level, which led to a stalemate. Eventually, the Pull Request was closed due to the author’s time constraints.
At the same time, there was also a discussion in Kubernetes about putting cgroup v1 support into maintenance mode. Therefore, pointing out that “the number of users migrating to cgroup v2 would increase soon, and more users would face this issue” and “ideally, control at the container level is preferable, but it would take time to provide this functionality to users,” we took over the closed Pull Request at PFN to include this singleProcessOOMKill in the upstream. The taken over Pull Request is as follows.
https://github.com/kubernetes/kubernetes/pull/126096
Since \lgtm was already issued on the previous Pull Request, I initially thought the main task would involve little change in implementation, mainly focusing on discussions. However, in reality, considerations of different environments and settings ensued discussions on how the flag should be designed, which were brought in by SIG API Machinery. This led to several re-implementations on that part.
Adding the singleProcessOOMKill Flag
This feature is related to Linux’s cgroup v2, but there was discussion about what to do, for instance, when set in environments supported by kubelet that use cgroup v1 or what validations should be in place in Windows environments. Various debates were held considering heterogeneous environment deployments, suggesting patterns such as ignoring and starting with only a warning log. After much discussion, the singleProcessOOMKill behaved as follows.
kubelet environment | singleProcessOOMKill | Startup of kubelet | Behavior of Pod | Note |
cgroup v2 | true | Successful | The OOM Killer stops only the tasks that got OOMed. | None |
cgroup v2 | false and default value | Successful | The OOM Killer stops the entire cgroup where the OOMed task exists. | None |
cgroup v1 | false | Fails and is marked as Invalid | – | memory.oom.group does not exist in v1, hence the failure. |
cgroup v1 | false and default value | Successful | The OOM Killer stops only the task that got OOMed. | No corresponding setting in v1, but no action is needed with the default behavior of v1, so it is not invalid. |
Windows | true / false | Fails and is marked as Invalid | – | Although kubelet supports Windows, the relevant feature is not applicable for Windows. |
Other OS | true / false | Fails and is marked as Invalid | – | Due to lack of support for the feature. |
Determining the default value for singleProcessOOMKill when no specific setting is provided also remains a topic for discussion. As can be seen from the table above, the default value depends on the version of cgroup in the runtime environment—true for cgroup v1 and false for cgroup v2. This means that the default value is dependent on the host environment, which raises the issue of “when to decide the default value.” In the initial implementation of the Pull Request, the default value was determined at the time of configuration loading. However, this method could cause problems in heterogeneous clusters where versions of cgroup are mixed or where different OSs are present. There have been issues, for example, when kubelet configuration files generated on Linux were used on Windows nodes.
To avoid such problems, the decision on the default value is made at the time of kubelet startup. This allows the decision to be based on the environment in which kubelet is actually running. However, it also means that the default value is not known until kubelet starts up.
The implementation of whether to enable memory.oom.group is straightforward. In essence, applying memory.oom.group to cgroup is not something done by kubelet. It’s actually applied by the OCI Runtime, the role of which is to set up this value.
Therefore, kubelet follows the OCI Runtime Specification and simply sets memory.oom.group to 1. Given these conditions, the following is the implemented code as of version 1.32:
if isCgroup2UnifiedMode() && !ptr.Deref(m.singleProcessOOMKill, true) { resources.Unified = map[string]string{ // Ask the kernel to kill all processes in the container cgroup in case of OOM. // See memory.oom.group in https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html for // more info. "memory.oom.group": "1", } }
The behavior until version 1.31 involved enabling memory.oom.group for cgroup v2. The only difference with the above code is the conditional branching based on the singleProcessOOMKill flag.
if isCgroup2UnifiedMode() { resources.Unified = map[string]string{ // Ask the kernel to kill all processes in the container cgroup in case of OOM. // See memory.oom.group in https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html for // more info. "memory.oom.group": "1", } }
Merged
The Pull Request was ultimately merged the day before the code freeze for Kubernetes v1.32. In reality, aside from the design of the flag, there were also revisions needed for the e2e tests from the previous Pull Request that were insufficient.
There was also consideration for backporting to other versions, but it was decided against because it did not meet the requirements of Kubernetes’ cherry-pick guideline. Thus, if this issue seems likely to be problematic, it would be advisable to wait for Kubernetes v1.32 before enabling cgroup v2.
When the efforts first started, this Pull Request wasn’t receiving much attention. However, as reactions to the Pull Request started to appear, it seemed that users transitioning to cgroup v2 were indeed struggling with this change in behavior as initially feared. Eventually, the Pull Request gained more reactions and references. There appeared to be similar issues with azuredisk using memory.oom.group, but updating to Kubernetes v1.32 should resolve these problems.
Conclusion
At PFN’s Cluster Services, we actively utilize cutting-edge features of OSS providing feedback to the OSS community, and offering patches when necessary, just as we have demonstrated in this case.
The Cluster Services team is currently looking for colleagues to contribute to OSS and develop and operate large-scale machine learning infrastructure together! You can apply for a casual interview here. Additionally, we are hiring for the following positions. If you are interested, please feel free to contact us!
Area