ANR SAFE

Abstract

When applied to communication networks, traditional approaches for control and decision-making require a comprehensive knowledge of system and user behaviours, which is unrealistic in practice when there is an increase in scale and complexity. Data-driven AI approaches do not have this drawback, but offer no safety bounds and are difficult to interpret. The SAFE project aims to design an innovative approach by combining the best of both worlds. In this new approach, intelligence is distributed in the network between a global AI (at the central level) and local AIs (at the edge level) collaborating with each other by integrating traditional models with graph neural networks and reinforcement learning. The approach, developed for partially or completely observable/controllable environments, will natively integrate safety bounds, interpretability and provide self-adaptive systems for routing, traffic engineering and scheduling.

Today, computer networks are facing important architectural transformations. During the last decade, Software Defined Networking (SDN) introduced the idea of decoupling control and data planes [Bradai15]. From the system side, the key advances include compute-intensive control planes, network virtualisation, data plane programmability, in-network processing with Artificial Intelligent (AI) accelerators, programmable pipelines [Lee20], which enable more agility and intelligence inside networks. Most of this transformation is supported by the long-term vision of self-driving networks where automation and intelligence are two essential ingredients. Powerful and programmable devices are fostering the use of advanced Machine Learning (ML) and optimisation techniques to intelligently control networks, for a better user experience and utilisation of resources.

Looking at current deployments, we see that network and service providers are facing increasing network complexity combined with a need to support an ever-increasing variety of traffic and applications. At the same time, users are always asking for better Quality of Experience (QoE) with an increasing variety of requirements, especially coming from new applications. Thus, there is a need for agile, flexible and fully autonomous networks to accommodate a plethora of new services. Costs (OPEX/CAPEX) also need to be contained to stay competitive. Networks should largely self-manage themselves and deal with issues such as routing, resource allocation, QoE and traffic engineering. This requires new algorithms for decision-making and control.

A lot of research works are based on networking domain knowledge to model and optimize networks. However, such methods, for example based only on optimization or control theory, require full knowledge and information about the system, which is not possible in practice. Also, non-linearity, uncertainty and intractability may lead to simplifications. On the other hand, while data-driven ML approaches have obtained good results, most of them are black boxes. Interpreting their results and knowing when they can work or fail is difficult. To have the best of both approaches, we believe that networking knowledge-based approaches and data-driven approaches should work in synergy. Networking knowledge-based models can guide and control the learning of data-driven ML approaches, which in turn, can learn about new situations from new data. Also, these models can help avoid exploration of the huge search space of infeasible solutions.

ML has been successful for some classification and monitoring problems, including our works [Saffar19]. We plan to further solve control and decision making networking problems, by using ML in synergy with networking knowledge-based approaches. Control and decision-making algorithms are critical for the operation of networks, hence we believe that the solutions should be safety bounded and interpretable. This is a scientific barrier that needs to be lifted, as network operators are reluctant to use ML in production networks because of their critical and sensitive nature, e.g., as outages and performance degradations can be very costly.

Hierarchical architecture: Assuming modern network architectures, we will design a ML architecture based on global AI (running at central controller level) and local AI(s) (running at edge device level) for decision-making in partially as well as fully observable and controllable environments. Global AI will be able to control, configure and install policies on local AI.
Algorithms for partially observable environments: We will design new safety bounded and interpretable algorithms for intelligent path selection, automatic queueing and scheduling algorithms for partially observable and controllable environments. As shown in Figure 1, these methods find use cases in SD-WAN (Software-Defined Wide Area Networks), where edge devices present at customer premises need to collaboratively operate in overlay on top of uncontrollable and only partially observable core networks.
Algorithms for fully observable environments: We will investigate the application of the global and local AI architecture for fully observable and controllable environments. Specifically, we will design new safety bounded and interpretable algorithms for flow scheduling, software-defined routing and traffic engineering, which find use cases in data centers as well as private WANs connecting multiple data centers, and perform closed loop actions to improve network utilisation and optimise network metrics such as QoS and QoE.

AI is a large field and ML is just one branch. ML approaches can learn about new situations with new data. We focus on some ML approaches that SAFE will consider for global and local AI, to solve problems such as scheduling, path selection, traffic engineering etc.

Reinforcement Learning (RL): Works based on RL learn an optimal policy interacting with the environment. Deep Reinforcement Learning (DRL) extends RL with neural networks to enhance learning capabilities of intelligent agents. Such works include or projects such as “6G Brains” (H2020-101017226) for partially observable environments. DRL has also been used for traffic engineering in SD-WAN. In the domain of fully observable and controllable environments, applies DRL inside data center networks.

One problem is that such approaches do not consider different network metrics and the graph nature of the overall network topology. This is required for global planning, routing and traffic engineering problems, such as for inter data center connectivity over private WANs. They have a combinatorial nature with increased scale and complexity. Metrics such as QoE, link delays, end-to-end delay over multi-hops, losses, etc., involve non-linearities and, without the help of ML, their accurate estimation can be hard with traditional approaches. Traffic matrices are also complex to estimate. In our previous contributions, we used ML based QoE estimation and combined it with optimisation, but with simple delay and loss models. Thus, the first scientific challenge will be to consider such non-linear network characteristics as well as the graph nature of networking, using data-driven approaches. In the following, we discuss an appropriate approach.

Graph neural networks (GNNs): Lacking consideration for the graph based topology of networks, initial ML based routing [Valadarsky17] approaches may not outperform networking domain knowledge based models. To go forward, emerging ML algorithms such as graph representation learning and GNNs are expected to perform well on many networking problems. Indeed, GNNs have the ability to capture key properties from the topology and network conditions. Importantly, the learnt GNNs models are able to generalise over new topologies. Additionally other approaches such as RL can be combined with GNNs. In a previous contribution, we explored the capability of GNNs to solve resource allocation in networking. SAFE will study GNNs for consideration of nonlinear network metrics and combine them with optimisation to cope with the combinatorial nature of routing and traffic engineering. SAFE will extend existing GNNs based approaches by: (i) integrating extensive network characteristics (different link capacities, priorities, etc.), which are lacking in, as the work assumes homogeneous links for instance, and by (ii) combining DRL with GNNs to handle the dynamic nature of routing and traffic engineering. System dynamics can be considered using a model predictive control approach, but designing an accurate traffic predictor is a challenging task.

Index Terms

Reinforcement Learning.
SDN, 5G, Internet, CPE.
Resource Allocation.
Cloud, Software-Defined Wide Area Networks (SD-WAN).
Safety bounded and interpretable.
Quality of Service and Quality of Experience.

Presentation

Abstract

Index Terms