KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning

Preprint

1ShanghaiTech University 2Ant Group
{fengkun2025,shanzw2022,renkan}@shanghaitech.edu.cn
*Equal contribution, Work done during internship at Ant Group
Corresponding author: renkan@shanghaitech.edu.cn
Forecasting paradigm comparison

Figure 1. Unlike LLM-only models that suffer numerical hallucinations and TSFM-only models that lack semantic reasoning, KairosAgent bridges semantic reasoning and numerical forecasting.

Abstract

Cross-domain multimodal time series forecasting is a challenging task, requiring models to integrate precise numerical comprehension, cross-domain semantic understanding, and effective multimodal fusion. Existing approaches either build Time Series Foundation Models (TSFMs) from scratch or leverage pretrained Large Language Models (LLMs). However, TSFMs often overlook semantic understanding and lack the ability to perform future-oriented semantic reasoning, and LLMs struggle with numerical comprehension and accurate quantitative forecasting. To overcome these limitations, we propose KairosAgent, a novel agentic framework for multimodal time series forecasting, including an LLM-based reasoner and a TSFM-based forecaster. KairosAgent unifies textual reasoning and numerical forecasting by dynamically invoking analytical tools to enhance the numerical understanding and semantic reasoning capabilities of LLMs. The reasoning results are subsequently fused into the TSFM pipeline, enabling more accurate and reliable future predictions. To further improve the reasoning, we curate a large-scale corpus of high-quality trajectories, alongside a reinforcement learning from forecasting paradigm with multi-turn refinement and turn-level credit assignment. Experiments demonstrate that KairosAgent achieves superior zero-shot forecasting performance while maximizing the utility of pretrained LLMs and TSFMs, presenting a promising direction for efficient and interpretable time series agents.

Overview

KairosAgent overview

Figure 2. KairosAgent bridges semantic reasoning and numerical forecasting through an LLM reasoner, tool-grounded morphology analysis, and a TSFM forecaster.

Methodology

KairosAgent follows a modular reason-then-forecast design: the LLM reasoner handles semantic pattern analysis, while the TSFM forecaster preserves precise numerical prediction.

Overall Framework

Stage I: Tool-Augmented Morphological Reasoning

Given historical observations and textual context, the LLM reasoner interacts with statistical tools over multiple turns to inspect trend, periodicity, volatility, and regime changes. It then synthesizes a compact morphology description that captures anticipated future patterns without committing to exact numeric values.

Stage II: Morphology-Conditioned Forecasting

The morphology description is encoded as a semantic prior and fused into the TSFM decoder through lightweight gated cross-modal fusion. This keeps numerical generation inside the native time series model while injecting future-oriented semantic reasoning into the forecasting pipeline.

T-STAR Corpus

KairosAgent is trained with T-STAR, a 40k-trajectory time series reasoning corpus with tool augmentation. The corpus covers diverse domains and provides process-level supervision for multi-turn analytical reasoning.

T-STAR corpus overview

Figure 3. Overview of the T-STAR corpus and its tool-augmented trajectory generation pipeline.

Training Strategy

Stage I: SFT of the Reasoner

SFT warms up the LLM reasoner on T-STAR trajectories, teaching it when to invoke tools, how to interpret tool feedback, and how to write structured morphology descriptions for downstream forecasting.

Stage II: Multimodal Alignment of the Forecaster

The TSFM forecaster is trained to consume morphology descriptions as semantic priors. A text encoder maps morphology descriptions into compact semantic embeddings, while cross-modal fusion modules inject these priors into the Kairos decoder under the quantile forecasting objective.

Stage III: RL with Fine-Grained Credit Assignment

GRPO refines the reasoner with the frozen Stage II forecaster as a reward module. Instead of assigning only a final trajectory reward, turn-level credit assignment scores the marginal forecasting utility of each reasoning turn and tool call.

Zero-Shot Reasoning Results

KairosAgent improves future morphology reasoning by grounding LLM analysis in statistical tool observations. With turn-level RL, the 4B reasoner achieves the best accuracy among comparable-scale models across all evaluated Time-MMD domains.

Model Climate Energy Traffic
Advanced Models (reference only)
GPT-5.2 97.80 84.12 34.95
DeepSeek-R1 99.18 79.51 37.76
Comparable-Scale Models
Llama-3.1-8B-Instruct 52.47 38.72 43.62
DeepSeek-R1-Distill-Qwen-7B 42.86 5.08 14.29
KairosAgent-4B (SFT-Only) 97.80 45.21 40.56
+ Outcome-Level Reward RL 96.70 43.33 38.27
+ Turn-Level Reward RL 98.08 50.47 43.88

Morphology reasoning accuracy (%) on Time-MMD. Red bold and blue underline mark best and second-best among open-source models.

Zero-Shot Forecasting Results

KairosAgent achieves strong zero-shot forecasting on both regular Time-MMD and irregular Time-IMM benchmarks. The Time-MMD table summarizes domain-level MSE and MAE against zero-shot TSFMs and full-shot baselines, while the Time-IMM radar chart shows robustness under temporal irregularities.

Regular Forecasting Evaluation

Type Zero-Shot Models Full-Shot Multimodal Models Full-Shot Unimodal Models
Models KairosAgent Aurora Sundial Moirai ChronosBolt T3Time TimeCMA CALF PatchTST DLinear
MSEMAE MSEMAE MSEMAE MSEMAE MSEMAE MSEMAE MSEMAE MSEMAE MSEMAE MSEMAE
Agriculture 0.1940.282 0.2820.356 0.3270.366 0.2390.306 0.2180.302 0.2290.303 0.3180.360 0.2410.311 0.2480.308 0.3770.396
Climate 0.8630.739 0.8630.747 0.9200.765 0.9820.792 0.9480.788 1.2060.894 1.2820.926 1.1990.895 1.1760.891 1.0360.807
Economy 0.1860.335 0.2750.412 0.2160.348 0.1980.345 0.1920.342 0.2390.384 0.2620.412 0.2230.370 0.2230.380 0.2180.370
Energy 0.2170.330 0.2510.370 0.2340.337 0.2610.347 0.2630.355 0.2660.378 0.3510.447 0.2580.373 0.2430.353 0.2330.346
Environment 0.3780.435 0.2760.379 0.3790.443 0.4120.446 0.4270.462 0.4890.507 0.5360.533 0.5370.509 0.4960.513 0.5910.627
Security 76.6584.340 72.7634.085 83.4034.836 74.2494.129 73.9774.117 72.1134.070 72.0114.113 73.2674.040 76.1054.445 82.5214.891
Social Good 0.7690.376 0.8280.506 0.8190.377 0.8680.391 0.9510.388 0.9980.432 1.0920.578 0.8900.416 0.9590.475 0.8910.448
Traffic 0.1510.231 0.1620.289 0.2280.292 0.1860.263 0.2220.249 0.2890.368 0.2970.412 0.2270.305 0.2090.316 0.2190.315
1st Count 66 21 00 00 00 00 10 01 00 00

Time-MMD performance across diverse domains. Lower MSE and MAE are better. Red bold and blue underline mark best and second-best results.

Irregular Forecasting Evaluation

Time-IMM radar results

Figure 4. Time-IMM MAE comparison across irregular multimodal time series forecasting tasks.

Tool Usage Analysis

The agent learns a data-dependent tool selection policy rather than a fixed calling pattern. Tool usage shifts across reasoning turns and adapts to dataset-specific temporal properties.

Tool usage heatmaps

Figure 5. Tool usage distributions over reasoning turns and datasets in T-STAR trajectories.