Early TCP Flow Length Regression with Machine Learning

Abstract

We present a scalable data-driven machine learning approach for early and continuous TCP flow-length prediction, enabling Software-Defined Networking (SDN) controllers to make proactive, latency-aware routing decisions. Conventional Elephant Flow versus Mice Flow classification relies on static thresholds and several seconds of observation, delaying control decisions and degrading performance. In contrast, our method performs regression based estimation using only the first 400 ms of traffic. TCP/IP packets are tokenised into fixed 10 ms bins, preserving temporal dynamics while limiting monitoring overhead. An ensemble of Long Short-Term Memory (LSTM) networks extracts temporal features, which are fused by a Mixture Density Network (MDN) that outputs a full conditional distribution over flow length. Experiments on real-world CAIDA and MAWI traces show that the proposed framework achieves a mean absolute error of around 1.74 s, nearly halving the error of state-of-the-art baselines, while remaining computationally lightweight enough for in-network deployment.

Our framework predicts final TCP flow length using only early packets while remaining scalable to backbone traffic. It comprises a tokeniser, an LSTM ensemble for feature extraction, and a Mixture Density Network for probabilistic regression.

1. TCP flow tokenisation

Raw packet traces are decoded into TCP flows and aggregated into fixed-size 10 ms time bins. This tokenisation preserves temporal dynamics while regularising irregular arrival patterns and reducing data volume, exposing each flow as a short sequence of compact feature vectors.

2. LSTM ensemble feature extractor

To capture diverse temporal characteristics across flow-length ranges, we partition flows into stratified temporal domains and train an ensemble of LSTMs, each specialising in a distinct regime. During inference, the tokenised sequence is passed through all LSTMs; their hidden representations are concatenated into a single feature vector, providing a rich description of both short- and long-term behaviour.

3. Mixture Density Network regression

The concatenated features are fed to a Mixture Density Network that outputs the parameters of a Gaussian mixture over flow length. This probabilistic regression captures uncertainty and possible multi-modal futures, avoiding the limitations of single-point estimates from conventional regression models.

4. Early, proactive traffic engineering

By operating on only the first 400 ms of each flow, our model provides early flow-length predictions suitable for SDN controllers. On CAIDA and MAWI datasets the approach substantially reduces mean absolute error compared with linear regression, Bayesian regression, random forest, and SGD-based baselines, enabling proactive routing and scheduling decisions that reduce congestion, packet loss, and latency.

Main Results

We evaluate our LSTM–MDN framework on CAIDA and MAWI traces, comparing against linear regression, Bayesian regression, random forest, and SGD-based baselines. We vary the observation window from 400 ms to 1200 ms to study the impact of additional early traffic information.

The tables below are schematic summaries corresponding to Table I and Figures 3–5 in the paper; please refer to the paper for the exact numerical values.

Table I – Mean Absolute Error (MAE) vs. Baselines

Dataset	Method	Observation Window (ms)
Dataset	Method	400	800	1200	Best (↓)
MAWI	LSTM–MDN (ours)	≈1.74	≈1.70	≈1.66	1.66
	SGD Regression	≈5.32	≈5.33	≈5.34	≈5.32
	Random Forest	≈3.32	≈3.33	≈3.40	≈3.32
	Linear Regression	≈5.27	≈5.24	≈4.99	≈4.99
	Bayesian Regression	≈5.00	≈4.90	≈4.73	≈4.73
CAIDA	LSTM–MDN (ours)	≈1.78	≈1.75	≈1.69	1.69
	SGD Regression	≈5.56	≈5.45	≈5.22	≈5.22
	Random Forest	≈3.89	≈3.91	≈3.97	≈3.89
	Linear Regression	≈5.67	≈5.51	≈5.23	≈5.23
	Bayesian Regression	≈5.31	≈5.14	≈4.90	≈4.90

Our LSTM–MDN method consistently outperforms all baselines across both datasets and all observation windows, with particularly strong gains at 400 ms where early information is most scarce.

Figure 3 – LSTM–MDN vs. NN Regression

Flow prediction time and mean absolute error as a function of observation window for the LSTM feature extractor combined with either a standard neural network (NN) or an MDN regressor.

20-LSTM feature extractor with NN vs MDN regression across MAWI and CAIDA datasets

Figure 4 – Varying Ensemble Size

Flow prediction time versus ensemble size for the LSTM–MDN framework on MAWI, demonstrating improved accuracy and stability as the number of LSTMs increases up to twenty.

Flow prediction time using LSTM–MDN regression as the number of LSTM layers varies

Figure 5 – MCC during Pretraining

Average Matthews correlation coefficient (MCC) for the LSTM ensemble classification pretraining task as a function of observation window, showing improved discrimination of flow-length time domains with longer windows.

Average MCC values during LSTM ensemble classification pretraining across MAWI and CAIDA datasets

Practical Deployment and Scalability

Beyond offline evaluation, we study whether the proposed LSTM–MDN framework is practical for in-network deployment. Experiments are conducted on a workstation equipped with an NVIDIA Quadro RTX 5000 GPU and an Intel Xeon Silver 4210R CPU. Under this setup, the model processes the first 400 ms of each TCP flow in approximately 3 ms, including tokenisation, LSTM inference, and MDN regression. This low end-to-end latency allows flow-length predictions to be generated well within the timescale required for SDN control decisions.

The framework is designed to operate under realistic SDN deployment constraints. Packets are aggregated into fixed 10 ms bins at switches or mirror ports and forwarded as compact tokenised flows to a centralised Flow-Length Prediction Server. Each token contains only a small set of integer-valued features (for example, packet counts and byte volumes), resulting in an approximate telemetry footprint of a few hundred bytes per flow. This is substantially smaller than raw NetFlow or IPFIX records, which typically require several kilobytes per flow and often arrive too late to inform proactive control.

At the server, tokenised flows are batched and processed on commodity GPUs, enabling predictions for on the order of 10⁴ flows per second while keeping computational overhead predictable. Because aggregation and feature extraction occur close to the data plane, the approach scales with link rate rather than packet rate, and remains robust under high-speed backbone conditions.

These properties make the LSTM–MDN framework suitable for early and continuous flow-length prediction in production SDN deployments. By providing accurate estimates within the first few hundred milliseconds of a flow's lifetime, SDN controllers can proactively reroute incipient elephant flows, protect short latency-sensitive flows from interference, and reduce packet loss and queueing delay. As a result, flow-length prediction becomes a practical building block for fine-grained traffic engineering rather than a purely offline analysis tool.

BibTeX

@inproceedings{Orme2026TCPFlowLength,
  author    = {Anthony Orme and Anthony Adeyemi-Ejeye and Andrew Gilbert},
  title     = {Early TCP Flow Length Regression with Machine Learning},
  booktitle = {Proc. IEEE Wireless Communications and Networking Conference (WCNC), Track 4: Emerging Technologies, Network Architectures, and Applications},
  year      = {2026},
  month     = {April}
}