-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathintroduction.tex
66 lines (50 loc) · 9.12 KB
/
introduction.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
\chapter{Introduction}
\label{cha:introduction}
Learning fundamental behaviour of network data from packet traces is an extremely hard problem. While machine learning (ML) algorithms have been shown as an efficient way of learning from raw data, adapting such algorithms to the general network domain has been hard, to the point that the community doesn't attempt to do so. In our project, we argue that all is not lost, using some specific machine learning architectures like the Transformer, it is indeed possible to develop methods to learn from such data in a general manner.
\section{Motivation}
\label{sec:motivation}
Modelling network dynamics is a \emph{sequence modelling} problem. From a sequence of past packets, the goal is to estimate the current state of the network (\eg Is there congestion? Will the packet be dropped?), following which predict the state's evolution and future traffic's fate. Concretely, we can also decide on which action to take next \eg should the next packet be put on a different path? Due to the successes in the field of ML and learning from data, it is becoming increasing popular to use such algorithms for solving this modelling problem but the task is notoriously complex. There has been some success in using ML for specific applications in networks, including congestion control\cite{classic,jayDeepReinforcementLearning2019,dynamic,exmachina},
video streaming\cite{oboe,maoNeuralAdaptiveVideo2017,puffer},
traffic optimization\cite{auto},
routing\cite{learnroute},
flow size prediction\cite{flow,onlineflow},
MAC protocol optimization\cite{oneproto,heterowire},
and network simulation\cite{zhangMimicNetFastPerformance2021}, however a good framework for general purpose learning on network data, still doesn't exist.
Today's ML models trained are trained for specific tasks and do not generalize well; \ie they often fail to deliver outside of their original training environments\cite{puffer, datadriven, blackbox}. Due to this, generalizing to different tasks is not even considered. Recent work argues that, rather than hoping for generalization, one obtains better results by training in-situ, \ie using data collected in the deployment environment\cite{puffer}.
Today we tend to design and train models from scratch using model-specific datasets~(Figure \ref{fig:vision}, top). This process is arduous, expensive, repetitive and time-consuming. We always redo everything from scratch in the training process, and never make use of common objectives for training. Moreover, the growing resource requirements to even attempt training these models is increasing inequalities in networking research and, ultimately, hindering collective progress.
As ML algorithms (especially certain deep learning architectures) have shown generalization capabilities\cite{generalizingdnn} in other fields, where an initial \emph{pre-training} phase is used to train models on a large dataset, in a task-agnostic manner and then, in a \emph{fine-tuning} phase, the models are refined on smaller task specific datasets. This helps reuse the general pre-trained model across multiple tasks, making it resource and time efficient. This kind of transfer learning or generalization\cite{transferng} is enabled by using the pre-training phase to learn the overall structure in the data, followed by the fine-tuning phase to focus on learning more task-specific features. As long as there is a certain amount of similarity in the data's structure across the pre-training and fine-tuning, this method can have extremely efficient results.
\section{Tasks, Goals and Challenges}
\label{sec:task}
Inspired by ML models which generalize on data in several other fields, it should be possible to design a similar model, in order to have learning and generalization in networking. Even if the networking contexts (topology, network configuration, traffic, etc.) can be very diverse, the underlying dynamics of networks remain essentially the same; \eg when buffers fill up, queuing disciplines delay or drop packets. These dynamics can be learned with ML and should generalize and it should not be required to re-learn this fundamental behaviour for training a new model every time. Building such a generic network model for network data is challenging, but this effort would benefit the entire community. Starting from such a model, one would only need to collect a small task-specific dataset to fine-tune it (Figure \ref{fig:vision}, bottom), assuming that the pre-trained model generalizes well. This could even allow modelling rare events (\eg drops after link failures) for which, across the real network traces today, only little data is available due to its less frequent occurrence.
\begin{figure}
\centering
\includegraphics[scale=1.3]{figures/vision}
\caption{Can we collectively learn general network traffic dynamics \emph{once} and focus on task-specific data collecting and learning for \emph{many future models?} Credits: Alexander Dietmüller}
\label{fig:vision}
\end{figure}
While research shows some generalization in specific networking contexts\cite{jayDeepReinforcementLearning2019}, truly ``generic'' models, which are able to perform well on a wide range of tasks and networks, remain unavailable. This is caused by the fact that we usually do not train on datasets large enough to allow generalization, we only train on task-specific smaller datasets. Sequence modelling for a long time, has been infeasible even with architectures dedicated to sequence modelling, such as recurrent neural networks (RNNs), as they can only handle short sequence and are inefficient to train\cite{factor}. However, a few years ago, a new architecture for sequence modelling was proposed: the \emph{Transformer}\cite{vaswaniAttentionAllYou2017} and this proved to be ground-breaking for sequence modelling. This architecture is designed to train efficiently, enabling learning from massive datasets and unprecedented generalization. In a \emph{pre-training phase}, the transformer learns sequential ``structures'', \eg the structure of a language from a large corpus of texts. Then, in a much quicker \emph{fine-tuning phase}, the final stages of the model are adapted to a specific prediction task (\eg text sentiment analysis). Today, Transformers are among the state-of-the-art in natural language processing (NLP\cite{recentnlp}) and computer vision (CV\cite{cvsurvey}).
The generalization power of the Transformer, stems from its ability to learn ``contextual information'', using context from the neighbouring elements in a sequence, for a given element in the same sequence\cite{devlinBERTPretrainingDeep2019}.\footnote{Consider the word \emph{left} in two different contexts: I \emph{left} my book on the table. Turn \emph{left} at the next crossing. The transformer outputs for the word \emph{left} are different for each sequence as they encode the word's context.}
We can draw parallels between networking and NLP. In isolation, packet metadata (headers, etc.) provides limited insights into the network state, we also need the \emph{context}, \ie which we can get from the recent packet history.\footnote{Increasing latency over history indicates congestion.} Based on these parallel, we propose that a Transformer based architecture can also be design to generalise on network packet data.
Naively transposing NLP or CV transformers to networking fails, as the fundamental structure and biases\cite{biases} in the data are different. Generalizing on complex interactions in networks is not a trivial problem. We expect the following challenges for our Transformer design.
\begin{itemize}
\item
How do we adapt Transformers for learning on networking data?
\item
How do we assemble a dataset large and diverse enough to allow useful generalization?
\item
Which pre-training task would allow the model to generalize, and how far can we push generalization?
\item
How do we scale such a Transformer to arbitrarily large amounts of network data from extremely diverse environments?
\end{itemize}
\section{Overview}
\label{sec:overview}
We present in our thesis, a Network Traffic Transformer (NTT), which serves a first step, to design a Transformer model for learning on network packet data. We outline the following main technical contributions in this thesis:
\begin{itemize}
\item We present the required background on Transformers, which gives directions to our design, in Chapter \ref{cha:background}.
\item We present the detailed architectural design ideas behind our proof-of-concept NTT in Chapter \ref{cha:design}.
\item We present a detailed evaluation on pre-training and fine-tuning our first NTT models in Chapter \ref{cha:evaluation}.
\item We present several future research directions which can be used to improve our NTT in Chapter \ref{cha:outlook}.
\item We summarise our work and provide some concluding remarks in Chapter \ref{cha:summary}.
\item Supplementary technical details and supporting results are presented in Appendix \ref{app:a}, \ref{app:b} and \ref{app:c}.
\end{itemize}
A part of the work conducted during this thesis has been submitted as the following paper\cite{newhope} to HotNets '22, and hence, some parts of the thesis have been built upon work done in writing the paper.