TSRBENCH: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

Overview

Time series data is ubiquitous in real-world scenarios and crucial for critical applications ranging from energy management to traffic control. Consequently, the ability to reason over time series is a fundamental skill for generalist models to solve complex problems. However, current benchmarks for generalist models largely overlook this dimension.

To bridge this gap, we introduce TSRBench, a comprehensive multi-modal benchmark designed to stress-test the full spectrum of time series reasoning capabilities. TSRBench features: (i) a diverse set of 4,125 problems from 14 domains, categorized into 4 major dimensions: Perception, Reasoning, Prediction, and Decision-Making; and (ii) 15 tasks evaluating essential reasoning capabilities (e.g., numerical reasoning, causal discovery, abductive reasoning).

TSRBench provides a standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance generalist models.

Figure 1. Overview of TSRBench. — **Figure 1.** Overview of TSRBench. TSRBench evaluates generalist models across four core capabilities: Perception, Reasoning, Prediction, and Decision-Making.

TSRBench is designed with several key features that set it apart from existing benchmarks: (i) Comprehensive Coverage. TSRBench spans 4 major dimensions with 15 carefully designed tasks covering fundamental to advanced time series reasoning capabilities. (ii) Multi-modal Evaluation. TSRBench supports both textual and visual representations of time series data, enabling evaluation of models' ability to process and fuse information from different modalities. (iii) Diverse Real-world Domains. With data from 13 domains including energy, traffic, finance, and healthcare, TSRBench ensures evaluation on practical, real-world scenarios.

Benchmark Design

Four Key Perspectives

1. Time Series Perception: This dimension evaluates the ability to extract meaningful patterns and insights from time series data, covering four tasks: (i) Pattern Analysis, (ii) Noise Understanding, (iii) Anomaly Detection, and (iv) Similarity Analysis.

2. Time Series Reasoning: This dimension evaluates the ability to derive conclusions from temporal patterns and prior knowledge, covering seven tasks: (i) Etiological Reasoning, (ii) Causal Discovery, (iii) Abductive Reasoning, (iv) Temporal Relation Reasoning, (v) Numerical Reasoning, (vi) Deductive Reasoning, and (vii) Inductive Reasoning.

3. Time Series Prediction: We evaluate predictive capabilities through two tasks: (i) Time Series Forecasting and (ii) Event Prediction.

4. Time Series Decision-Making: This dimension assesses the ability of models to make decisions based on the understanding of both time series and context. We evaluate this through two aspects: (i) Qualitative Decision-Making and (ii) Quantitative Decision-Making.

Figure 2. Statistics of tasks in TSRBench. — **Figure 2.** Statistics of tasks in TSRBench.

Leaderboard

Model performance on TSRBench across 15 tasks in 4 dimensions. For proprietary models, "(T)" denotes textual time series, "(V)" denotes visualized time series, and "(T+V)" denotes both modalities.

Key Findings

Through extensive experiments evaluating over 30 leading proprietary and open-source LLMs, VLMs, and TSLLMs, we uncovered several key findings:

Finding 1. The scaling law still holds for most of the time series reasoning tasks on both LLMs and VLMs, except for time series prediction.

Figure 3. Scaling law analysis. — **Figure 3.** The scatter plots related to overall accuracy and model sizes. Each plot illustrates the relationship between the log-scaled size of parameter numbers and the performance across all models. The left corresponds to LLMs and the right is VLMs.

Finding 2. Perception, Reasoning, and Decision tasks are highly correlated, and have weak correlation with the Prediction task.

Figure 4. Task correlation analysis. — **Figure 4.** Spearman's rank correlation (ρ) between tasks. "(*)" marks correlations with p-values ≤ 0.05.

Finding 3. Although the textual and visual representations of time series achieve similar performance, they are strongly complementary.

Figure 5. Modality complementarity analysis. — **Figure 5.** Analysis of modality complementarity. Left: Comparison between textual and visual time series representations. Right: Ratio of model (T+V) answers identical to model (T) or model (V).

Finding 4. Tasks with high variance, highlighted in red in Figure 6, indicate that poorer-performing models can be improved through distillation from stronger ones, while low-accuracy, low-variance tasks shown in blue reveal shared weaknesses requiring better training data.

Figure 6. Performance variance analysis. — **Figure 6.** Performance distribution of evaluated models across TSRBench tasks.

Case Study

Use arrow keys (← →) or buttons to navigate through cases

BibTeX

@article{yu2026tsrbench,
  title={TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models},
  author={Yu, Fangxu and Guo, Xingang and Yuan, Lingzhi and Kang, Haoqiang and Zhao, Hongyu and Qin, Lianhui and Huang, Furong and Hu, Bin and Zhou, Tianyi},
  journal={arXiv preprint arXiv:2601.18744},
  year={2026}
}