Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

April 23, 20262604.21930

Nicolae Filat, Ahmed Hussain, Konstantinos Kalogiannis, Elena Burceanu

cs.LG

TLDR

Temporal taskification in streaming continual learning is a critical evaluation variable, as different splits of the same stream can drastically alter benchmark conclusions.

Key contributions

Temporal taskification, not just the learner or data, impacts streaming CL evaluation.
Different valid splits of the same data stream can lead to varying benchmark conclusions.
Introduces a framework with plasticity/stability profiles and Boundary-Profile Sensitivity (BPS).
Shows taskification alone significantly alters forecasting error, forgetting, and backward transfer.

Why it matters

This paper uncovers a critical, overlooked source of instability in streaming continual learning evaluations: how data streams are temporally taskified. It demonstrates that benchmark conclusions depend heavily on this preprocessing step, challenging current practices. This work motivates standardizing taskification for more robust and comparable CL research.

Original Abstract

Streaming Continual Learning (CL) typically converts a continuous stream into a sequence of discrete tasks through temporal partitioning. We argue that this temporal taskification step is not a neutral preprocessing choice, but a structural component of evaluation: different valid splits of the same stream can induce different CL regimes and therefore different benchmark conclusions. To study this effect, we introduce a taskification-level framework based on plasticity and stability profiles, a profile distance between taskifications, and Boundary-Profile Sensitivity (BPS), which diagnoses how strongly small boundary perturbations alter the induced regime before any CL model is trained. We evaluate continual finetuning, Experience Replay, Elastic Weight Consolidation, and Learning without Forgetting on network traffic forecasting with CESNET-Timeseries24, keeping the stream, model, and training budget fixed while varying only the temporal taskification. Across 9-, 30-, and 44-day splits, we observe substantial changes in forecasting error, forgetting, and backward transfer, showing that taskification alone can materially affect CL evaluation. We further find that shorter taskifications induce noisier distribution-level patterns, larger structural distances, and higher BPS, indicating greater sensitivity to boundary perturbations. These results show that benchmark conclusions in streaming CL depend not only on the learner and the data stream, but also on how that stream is taskified, motivating temporal taskification as a first-class evaluation variable.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers