Soda announces Intel oneAPI Center of Excellence to improve the performance of the scikit-learn machine learning library – Soda – Computational and mathematical methods to understand health and society with data

Fast and More Efficient Machine Learning across Architectures

[29 March, 2022] – Soda announces establishing an Intel oneAPI Center of Excellence for scikit-learn acceleration as part of a collaboration with Intel. In the information age, machine learning algorithms must efficiently manage large data sets. This requires scalable algorithms and efficient implementations running on heterogeneous hardware. The Intel oneAPI Center of Excellence will focus on developing hardware-optimized performance boosters for one of the most widely-used machine learning libraries: scikit-learn. This extension of scikit-learn will deliver more efficient machine learning by leveraging oneAPI software components. oneAPI is an open, simplified, heterogeneous programming model based on standards that delivers performance and productivity.

Olivier Grisel, member of the scikit-learn team at Inria

Heterogenous computing is inevitable. It happens when a host schedules computational tasks to different processors and accelerators like CPUs and GPUs. This partnership will make scikit-learn more performant and energy-efficient on multi-architecture systems, says Olivier Grisel – scikit-learn maintainer at Inria.

The problem: hardware-optimized implementations are typically not portable

Scikit-learn provides easy to maintain and portable implementations of standard machine learning algorithms with the help of the NumPy and SciPy libraries. When the overhead of the Python interpreter doesn’t allow efficient computation with tight loops, those implementations are typically written in Cython. Scikit-learn maintainers ship binary packages for a variety of platforms and CPU architectures (Linux, macOS, Windows) x (x86_64, arm64, ppc64le) from a single codebase with no external runtime dependencies beyond Python, NumPy, and SciPy.

Recently, GPU hardware has proven to be very competitive for many machine-learning-related workloads, either from a pure latency standpoint or from the standpoint of an improved computation/energy trade-off. However, hardware-optimized implementations mandate additional runtime dependencies.

The solution: an extension point for pluggable hardware-optimized computational routines

This project aims to extend scikit-learn with user-configurable computational engines for specific compute-intensive routines used by several scikit-learn estimators. It will enable leveraging hardware-specific implementations as an alternative to the default hardware-agnostic NumPy/SciPy/Cython implementation of scikit-learn.

The low-level routines to be implemented as part of this collaboration will boost popular workhorse algorithms such as k-nearest neighbors, k-means, logistic regression, t-SNE, and DBSCAN among others.

In addition to calling the optimized routines, the extension API will make it possible to delegate the input validation checks (i.e. checking for invalid values in the input data) to the extension package to avoid any unnecessary back and forth data transfers between host and device.

The implementation: relying on oneAPI

The project will develop highly optimized implementations of those routines using either oneAPI numba_dppy or DPC++ components wrapped with Cython or both. The device introspection and scheduling logic will rely on dpctl. This oneAPI implementation will be packaged in an independently managed project potentially co-maintained by scikit-learn core developers, Intel engineers, and other community members interested in this project.

While existing runtimes and build-time dependencies will remain unchanged in the main scikit-learn package, new hardware-specific dependencies will be part of specific extension packages installable from PyPI.org via pip (or from conda-forge via conda), alongside the usual scikit-learn package.

The Intel oneAPI Center of Excellence’s scikit-learn performance optimizations using oneAPI will speed machine learning across heterogeneous systems, including those using multiple vendors’ architectures. This exemplifies the value of oneAPI together with an open ecosystem in accelerating AI beyond proprietary programming limits, says Wei Li, vice president and general manager of Intel’s Artificial Intelligence and Analytics group.

For more technical details and implementation, review this GitHub issue. The Center of Excellence projects will be distributed under the open-source license used by the scikit-learn project, namely the 3-Clause BSD license. The development of this new extension point will follow the standard community-driven contribution workflow on GitHub.

About scikit-learn
Scikit-learn is the most popular machine learning library (source: Kaggle data science survey 2021) with more than one million unique monthly visitors on the main documentation site and 25 million monthly downloads on PyPI.org. This open-source library provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering, and dimensionality reduction via a consistent interface, a well-written and exemplified documentation as well as a strong community of users, contributors and instructors.

About oneAPI
oneAPI is an open, unified and cross-architecture programming model for CPUs and accelerator architectures (GPUs, FPGAs, and others). Based on standards, the programming model simplifies software development and delivers uncompromised performance for accelerated compute without proprietary lock-in, while enabling the integration of legacy code. With oneAPI, developers can choose the best architecture for the specific problem they are trying to solve without rewriting software for the next architecture and platform

About Soda and Inria
Soda is a world-class research team developing mathematical and computational methods to understand health and society with data at Inria. Inria is the French national research institute for digital science and technology. World-class research, technological innovation and entrepreneurial risk are its DNA. In 200 project teams, most of which are shared with major research universities, more than 3,900 researchers and engineers explore new paths, often in an interdisciplinary manner and in collaboration with industrial partners to meet ambitious challenges. As a technological institute, Inria supports the diversity of innovation pathways: from open source software publishing to the creation of technological startups (Deeptech).

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.