What is the MN-Core’s runtime
My name is Akira Kawata, and I am a member of the MN-Core™ compiler team. This article introduces the MN-Core software stack’s runtime. Here, “MN-Core” means MN-Core series including MN-Core 1 and MN-Core 2.
The MN-Core software stack is composed of two different softwares, compiler and runtime. The compiler is a software which takes an ONNX file as the input and emits MN-Core hardware instructions. In contrast, the runtime is a software that takes MN-Core hardware instructions and inputs data, sends the instructions to the MN-Core, and gets the result from MN-Core.
The magnitude of orders of the runtime is quite different from the compiler. Although the MN-Core compiler may spend more than 20 hours to emit optimized instructions, the runtime must work on the order of 100ms to 1ms. The part of the runtime trace that I optimized a few months ago is visualized below. This is because, in the case of machine learning workloads, the whole MN-Core runtime works in every iteration, while the MN-Core compiler works only once. It means that the MN-Core runtime may operate more than one million times to run a workload. The horizontal direction represents the passage of time, and the vertical direction represents the depth of function calls.
Structure of the MN-Core runtime
The MN-Core runtime is composed out of 3 software layers.
The user’s Python code uses the Python layer. This layer enables users to easily use MN-Core without caring about the difference between the MN-Core hardware and a CPU/GPU. In the ideal case, the user can run their code by changing “cpu”, which specifies the device to “mncore”.
The Python layer depends on the C++ layer, which is the most complicated of the three. This layer abstracts the MN-Core hardware features into simple APIs and absorbs the differences between MN-Core 1 and MN-Core 2.
The C++ layer depends on the C layer. This layer comprises a C-written Linux kernel module and a thin wrapper around it. It converts the MN-Core hardware into a Linux device file. There is no software under this layer.
Difficulty of abstracting MN-Core hardware
From here, I will introduce some difficulties in abstracting the MN-Core hardware. As I said above, the most complicated software in MN-Core runtime is the C++ layer, and we implemented techniques to overcome such difficulties in this layer.
To run applications on MN-Core 1, we must constantly feed instructions to the MN-Core board through the PCIe bus. MN-Core 1 has no memory or registers to remember the program or program counter; it simply executes instructions that flow from the PCIe bus. This is possible because MN-Core has no branch instructions, and all processor elements work with the same instruction. This enables us to simplify the hardware and fine grained control from the software.
However, because of this, we must supply instructions from the PCIe bus faster than MN-Core chip consumes them when it runs. Otherwise, MN-Core causes a stall, and the efficiency drops. MN-Core 2, which we published recently, consumes instructions faster than MN-Core 1, and our runtime software cannot supply instructions at enough speed. So, we have encountered a strange bug in which the efficiency of MN-Core 2 decreases when the host CPU is busy. Luckily, we can resolve this bug by rewriting the copying instructions’ logic. This bug was a very MN-Core-specific interesting bug.
Furthermore, we must send input data and instructions and get results from the device to run a machine-learning workload. We must send inputs before sending the instructions and get the results after the instructions. In addition to these procedures, we must also run preprocessing of the input data on the host computer. We must do all of these procedures in the specified order.
To run them in order, we implemented an Open CL-like event API. We tuned this event API carefully to work fast enough in a multi-threaded environment.
The fun of MN-Core runtime
Lastly, I’ll introduce the fun parts with developing the MN-Core runtime.
What is exciting about the MN-Core runtime is that we can run the actual physical MN-Core device in front of us using our runtime. I visited the MN-Core 2 trial cluster the other day; emotionally, we are handling all instructions and data in this cluster. The following picture is the working MN-Core 2 with our MN-Core runtime.
In this blog post, I introduced the MN-Core runtime, which sends instructions and input data to MN-Core and gets results back. There are still many other interesting aspects of MN-Core and its runtime that I, unfortunately, could not fit into this article. I particularly like its binary format and event API implementation, which we will present in another article. If you are interested in these topics, please apply for the position of compiler engineer for MN-Core.
Links to related articles