Exploring a oneAPI-based GPU-powered backend for scikit-learn – Soda – Computational and mathematical methods to understand health and society with data

Context

We at the soda-inria team experimented with developing a scikit-learn computational backend, called sklearn-numba-dpex using the oneAPI-based numba_dpex JIT compiler, so that the backend can run on CPUs, (integrated) iGPUs, and GPUs alltogether. It comes along the development of a plugin system in scikit-learn that opens up scikit-learn estimators to alternative backends.

The backend implements Python bindings using the Intel-led numba-like numba_dpex JIT compiler that translates Python functions into compiled code, built on top of dpnp a oneAPI-powered numpy drop-in replacement, and dpctl, a lower-level library that exposes Python bindings for SYCL concepts such as Queue and Kernels, and in particular an Array-API compatible array library, dpctl.tensor .

Goals

The goal of our study is to evaluate the potential of the oneAPI-based software stack for ease of installation, interoperability, and performance on GPUs for sklearn popular estimators.

GPUs are known to be the preferred hardware for deep-learning based applications, but their use for a wide range of other algorithms has also been proved to be relevant: k-means, nearest neighbors search, gradient boosting trees… CPU-based implementations can be outshined and more particularly so where the data is plentiful to the point where the duration for training an estimator becomes a bottleneck. We want to evaluate the point where it really starts to matter, and if it really is a concern for sklearn typical use cases, either that it unlocks use with very large datasets, or interactive productivity on datasets that requires tens of seconds/minutes on CPU.

Targeted hardware, platform and OS

The main targeted hardware for this project are CPUs, Intel iGPUs (integrated graphical chipset that mostly come embedded in laptops), and Intel latest discrete GPU series, namely Flex and Max series for servers, and Arc series for consumer-grade computers.

While we obviously don’t expect performance on iGPUs to be comparable to discrete GPUs, it’s particularly interesting to see how it compares to CPUs they’re embedded with. Many scikit-learn users might just want to run compute on their personal laptop, any significant speed-up there is very valuable.

While oneAPI latest releases do have cuda and hip support, numba_dpex is not yet able to compile and run kernels for NVIDIA gpus and is not tested for AMD gpus. Hence the choice of using numba_dpex didn’t let us extend the scope of our benchmarks to other vendors. Preliminary investigations have shown potential for extending compatibility .

We run the software on Linux distributions. We ran benchmarks on personal laptops, as well as in the cloud using Intel developer edge cloud and the beta version of the new Intel Developer Cloud…

Installation

Installing a working numba_dpex runtime would consist in three steps:

installing low-level drivers
installing low-level runtime libraries
installing high level runtime libraries

For the first two steps, we used the official instructions that are available for Ubuntu on Intel dev cloud servers. It just works.

Note that it requires a sequence of somewhat unusual steps (like editing grub configuration file), also it includes installing a specific version of the Linux kernel, which, while it can be OK for a server setup that is designed for a single task, might be troublesome for general purpose workstations. There are alternative installers that don’t require pinning the kernel version, but we have found those generally hard to use.

The high level runtime libraries, including the oneAPI-based runtimes and the Python libraries, can be installed using conda following our guide, which again has some complexity. For instance, it requires the use of vendor specific channels instead of the more popular, community maintained conda-forgechannel.

We also provide a docker image that has proved to be stable and enable quick starting of ready-to-go environments. We’ve found that using Intel GPUs from within a docker container (using the --device=/dev/dri option) works well for all GPU architectures (tested with iGPU and Max series).

Interoperability

numba_dpex higher level libraries are not yet compatible with AMD and NVIDIA GPUs, but it already showcases its potential by offering seemless compatibiltiy with CPUs, Intel iGPUs and Intel GPUs. Meaning, that the same code that we wrote and optimized for GPUs, can be compiled and executed on CPU, and also works on iGPUs. This is awesome since it means:

I can start writing code on my favorite terminal, be it a GPU-less computer, and start benchmarking on high performance hardware only after having ensured that the code works (returns the expected outputs) on my daily-use machine.
a first-level of continuous integration can be setup easily on GitHub and run unit tests on CPU. It is much cheaper, accessible and simple than requiring cloud access to GPU-powered VMs that are typically necessary to be able to setup Continuous Integration for software projects that are based on CUDA. For sklearn_numba_dpex, the automated unit-test pipeline is started with pytest:
```
pytest -v sklearn_numba_dpex/
```
and since the default GitHub Actions runners do not provide access to GPUs, the tests will run on CPU. Running the same command in a local environment that provides a GPU or an iGPU will instead automatically run the tests on the GPU. SYCL has built-in environment variables that enable forcing a specific device (as long as it’s available):
```
# force all SYCL instructions to run on CPU SYCL_DEVICE_FILTER=cpu pytest -v sklearn_numba_dpex/    
```
if one wants to, for instance, specifically reproduce a GPU-less environment of a CPU-only CI runner.
being able to leverage iGPUs unlocks performance improvements on simple personal laptops, which will benefit to an even more wider part of the user base.

We’ve found that code that works on CPU do translate to code that works on GPU/iGPU, and inversely so. Except for minor quantitative differences regarding device parameters, the oneAPI concepts that programmatically describe devices are abstracted in a way that will be interpreted into working instructions for both GPU and CPU architectures.

We note that while the same code will always work on devices of different nature, in general algorithm optimization specializes for a type of device architecture so that good performance on a given device type will not translate to good performance on other types. For sklearn_numba_dpex we seek best performance for GPUs, while running the algorithms on CPU is really useful for testing, the performance will be inferior, and we don’t include CPU runs of the sklearn_numba_dpex engine in the benchmarks. We rather compare to scikit-learn and scikit-learn-intelex implementations that are well optimized for CPU.

Performance of k-means on Intel iGPUs, Flex and Max series

We implemented a production-ready numba_dpex engine for k-means and devised a scikit-learn branch that exposes a plugin interface that the engine can plug into. Then, we can ensure that our plugin implements the exact same specs than scikit-learn vanilla k-means with the same quality control, by running scikit-learn unit tests for k-means when the engine is activated.

Both from a tflops and memory usage point of view, k-means implementation on GPU benefits from fusing together pairwise distance computation and weight updates in the same set of parallel task dispatch, using special computational tricks that are very specific to GPU architectures. This is typically not possible to reproduce with high-level primitives available in array libraries such as numpy or pytorch. Thus a low-level GPU implementation framework like numba_dpex is well adapted for k-means and we expect this example to showcase both the performance improvement one can get from a GPU implementation of k-means, and the power of the numba_dpex framework.

Methodology:

Per-device tuning of performance parameters

GPU implementations variabilize performance parameters, such as size of allocated shared memory, width of contiguous memory RW steps, number of iterations per thread, size of groups of threads… we found that the performance on a given device can be very sensitive to variations of the parameters, and that a set of best parameters for a given device can yield bad performance for another device.

The performance thereafter for each device are reported after the performance parameters have been tuned by automatically grid-searching the parameter grid on small-ish inputs and keeping the combination that gave the best benchmark.

In practice, currently our repository sklearn_numba_dpex uses a set of parameters that we have found optimal for the Max series GPUs, but that can translate to worse performance for other tested devices.

To unlock best performance for all devices and all algorithms, further work should be considered:

either maintaining a catalogue of best performance parameters for all supported devices
or decorating the JIT compiler with an autotuner that runs a parameter search on the user hardware right after compile-time.

JIT compilation time

The JIT compilation time is substracted from the total walltime in the benchmark. In practice, it is not negligible but reasonably fast. For the k-means, expect a few seconds of compilation time for the first calls of fit and predict methods, once compiled the binaries are cached for the remaining of the session.

Benchmark Trends

The following figures display benchmark walltimes using CPU (with scikit-learn-intelex and scikit-learn CPU-optimized implementations), iGPU, and discrete GPUs (using numba-dpex), for k=127 clusters, on a dataset with small dimensions (d=14) in line with k-means typical usecases, and 50 million samples. We measure performance over 100 k-means iterations. The benchmark can be reproduced using our benchmark script.

Data source, configuration details and instructions ⁽¹⁾

With this setting, discrete GPU will complete within seconds while CPU implementations will complete within minutes, with up to 10 times faster compute on discrete GPUs. It is in line, and even faster, than what is foresighted by rapids.ai benchmarks with cuda devices. GPU-backed implementations will noticeably improve productive interactivity with k-means for dataset with similar characteristics, as long as the whole dataset can fit into the GPU memory (that could be up to 5 times more data on discrete GPUs than used on our benchmark). Going beyond this size of dataset would require carefully loading slice of datasets back and forth between system memory and device memory, which calls for additional developments.

iGPU performance is not as impressive, but for the cost and accessibility of an iGPU it still offers a decent speed-up over the CPU it’s embedded with, that have the potential to address a large userbase, with up to twice faster compute than CPU.

As expected, discrete GPUs performance is leaps and bounds ahead, with the Max Series GPU, the most performant GPU in Intel cloud offer, yielding about 20% speed-up over the Flex Series.

Conclusion and future work

We were really pleased with the performance we got for our k-means and are now discussing going forward with merging the plugin-api into scikit-learn so the plugin can be released to a large pool of users. We hope that numba_dpex libraries and dependencies will become easier to install using pip with the default pypi.org repository or using conda with the community-based conda-forge channel, so that users are not discouraged by issues with managing environments that can occur with the state of current installers.

Our next target would be implementing a k-nearest neighbors estimator with optimized performance for GPU and good interoperability. Unlike k-means, this estimator does not benefit from fusing kernels, and we intend to explore using the high level oneAPI-based primitives (topk/partition and matmul/cdist) that are exposed in PyTorch using the Intel Extension for PyTorch. Those kernels are key basic building blocks of many algorithms so we expect it to be optimized by specialized teams and yield a lot more performance than if we re-implement with numba_dpex. Plus, using the pytorch front-end natively will enable interoperability with NVIDIA and AMD devices using the native pytorch binaries.

More context

SYCL is an open specification maintained by the Khronos Group, a consortium that also has to its credit pivotal specifications such as OpenCL, OpenGL, WebGL or Vulkan. SYCL is a programming model that aims at writing and dispatching data and tasks on heterogeneous hardware, including CPUs and GPUs of all manufacturers, in a simple and flexible way.

Practically speaking, interoperability with devices of all manufacturers is a work in progress. Intel’s bleeding edge llvm project embeds an open-source implementation of the SYCL specification, whose first target has been Intel branded hardware, but it also contains POCs of interoperability for all major manufacturers, including NVIDIA and AMD, with CUDA and HIP/ROCM backends. The proprietary compiler dpcpp shipped by the oneAPI Basekit from Intel builds on top of this library, and can also be extended with plugins developed by Codeplay Software that extends compatibility to AMD and NVIDIA GPUs. (other alternative implementations are also under active developments such as OpenSYCL).

As opposed to the CUDA programming model that targets NVIDIA devices specifically, developers can use SYCL-based programming front-ends to write their application without specializing the stack for a specific device type or a specific device manufacturer in mind, and ensure it will run on any device, provided low-level runtime drivers are installed on the user machine. This appeals both to developers, that can easily target a wider range of users, and to users, that can run seemlessly the same software on whatever compute power they have access to.

Software accessibility is a strong principle within the design philosophy of the scikit-learn library. The scikit-learn project aims at developing and distributing user-friendly software, that is easy to use and easy to install. Scikit-learn serves users that expect their software to be installable with a single command, and then provide a ready-to-use data science environment with the best performance on any computer they have access to. On paper, SYCL-based backends have the potential to unlock performance improvements without adding user-side complexity because they can detect and use any available device.

(1) Data source: spoken-arabic-digit (augmented)

Configuration details and workload setup: OneAPI 2023.0.0

– Laptop CPU + iGPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz + iGPU TigerLake-LP GT2 + docker 24.0.2 runtime

– Cloud CPU / Flex Series setup: Intel(R) Xeon(R) D-2796NT CPU @ 2.00GHz + Intel® Data Center GPU Flex 170 + conda runtime

– Max Series setup: 4th Generation Intel® Xeon® Scalable processors 0x806f2 + Intel® Data Center GPU Max 1100 + docker 24.0.2 runtime

Reproducible from instructions and code provided at https://github.com/soda-inria/sklearn-numba-dpex/tree/9c17d140620fb4e138faa25ba165f3a4c4954051