V4FinBench: Benchmarking Tabular Foundation Models, LLMs, and Standard Methods on Corporate Bankruptcy Prediction

May 11, 20262605.10896

Marcin Kostrzewa, Sebastian Tomczak, Roman Furman, Anna Poberezhna, Michał Furgała + 2 more

cs.LG

TLDR

V4FinBench introduces a new large-scale dataset for corporate bankruptcy prediction, benchmarking tabular foundation models and LLMs against standard methods.

Key contributions

Introduces V4FinBench, a 1M+ record dataset with 131 features for multi-horizon bankruptcy prediction.
Benchmarks tabular models (TabPFN) and LLMs (Llama-3-8B) against standard methods on imbalanced data.
Shows finetuned TabPFN matches/exceeds gradient boosting at longer horizons, with transferable insights.

Why it matters

This paper addresses the need for larger, realistic datasets in corporate bankruptcy prediction. V4FinBench provides a vital resource, enabling robust evaluation of foundation models and LLMs under real-world, imbalanced conditions. Findings highlight TabPFN's potential and current LLM limitations in this complex financial task.

Original Abstract

Corporate bankruptcy prediction is a high-stakes financial task characterized by severe class imbalance and multi-horizon forecasting demands. Public datasets supporting it remain scarce and small: widely used free benchmarks contain between 6,000 and 80,000 company-year observations, while larger resources are behind subscription paywalls. To address this gap, we introduce V4FinBench, a benchmark of over one million company-year records from the Visegràd Group (V4) economies (2006-2021), with 131 financial and non-financial features, six prediction horizons, and a composite distress criterion jointly capturing solvency, profitability, and liquidity deterioration. V4FinBench is designed to support the evaluation of tabular and foundation-model methods under realistic class imbalance, with positive rates between 0.19% and 0.36%. We provide reference evaluations of standard tabular baselines, finetuned TabPFN, and QLoRA-finetuned Llama-3-8B. With imbalance-aware finetuning, TabPFN matches or exceeds gradient boosting at longer time horizons on both $F_1$-score and ROC-AUC. In contrast, Llama-3-8B trails gradient boosting on ROC-AUC at every horizon and is generally weaker on $F_1$-score, with the gap widening sharply beyond the immediate horizon. In an external evaluation on the American Bankruptcy Dataset, the V4FinBench-finetuned TabPFN checkpoint improves over vanilla TabPFN, suggesting that adaptation captures transferable financial-distress structure rather than only V4-specific patterns. V4FinBench is publicly released to support further evaluation and development of prediction methods on realistic financial data.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers