ArXiv TLDR

Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs

🐦 Tweet
2604.18576

Kevin Murphy

cs.AI

TLDR

BLF is an agentic system for binary forecasting that achieves state-of-the-art performance using a Bayesian linguistic belief state and hierarchical methods.

Key contributions

  • Proposes BLF, an agentic system achieving state-of-the-art binary forecasting on ForecastBench, outperforming major LLMs.
  • Introduces a Bayesian linguistic belief state for iterative evidence integration, avoiding ever-growing context windows.
  • Develops hierarchical multi-trial aggregation with logit-space shrinkage for combining K independent forecasting trials.
  • Presents hierarchical Platt scaling for calibration, preventing over-shrinking of extreme predictions from skewed sources.

Why it matters

This paper introduces a novel agentic forecasting system, BLF, that significantly advances the state-of-the-art in binary forecasting. Its innovative use of a structured linguistic belief state and hierarchical methods provides robust and accurate predictions. This work offers a powerful new approach for reliable probabilistic forecasting.

Original Abstract

We present BLF (Bayesian Linguistic Forecaster), an agentic system for binary forecasting that achieves state-of-the-art performance on the ForecastBench benchmark. The system is built on three ideas. (1) A Bayesian linguistic belief state: a semi-structured representation combining numerical probability estimates with natural-language evidence summaries, updated by the LLM at each step of an iterative tool-use loop. This contrasts with the common approach of appending all retrieved evidence to an ever-growing context. (2) Hierarchical multi-trial aggregation: running $K$ independent trials and combining them using logit-space shrinkage with a data-dependent prior. (3) Hierarchical calibration: Platt scaling with a hierarchical prior, which avoids over-shrinking extreme predictions for sources with skewed base rates. On 400 backtesting questions from the ForecastBench leaderboard, BLF outperforms all the top public methods, including Cassi, GPT-5, Grok~4.20, and Foresight-32B. Ablation studies show that the structured belief state is as impactful as web search access, and that shrinkage aggregation and hierarchical calibration each provide significant additional gains. In addition, we develop a robust back-testing framework with a leakage rate below 1.5\%, and use rigorous statistical methodology to compare different methods while controlling for various sources of noise.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.