When Graph Structure Becomes a Liability: A Critical Re-Evaluation of Graph Neural Networks for Bitcoin Fraud Detection under Temporal Distribution Shift

April 21, 20262604.19514

cs.LGcs.AIcs.CRcs.SI

TLDR

GNNs underperform feature-only models for Bitcoin fraud detection on Elliptic when evaluated strictly, revealing data leakage and misleading graph topology.

Key contributions

GNNs underperform feature-only baselines for Bitcoin fraud detection under strict inductive evaluation.
Data leakage from training-time exposure to test-period adjacency causes a 39.5-point F1 performance gap.
Real transaction graph topology can be misleading under temporal shift; random graphs performed better.
Hybrid GNN-feature models offer marginal gains, failing to surpass feature-only baselines.

Why it matters

This paper challenges GNN superiority for Bitcoin fraud detection, revealing data leakage in prior evaluations. It shows graph structure can be a liability under temporal shifts, stressing the need for rigorous, leakage-free evaluation.

Original Abstract

The consensus that GCN, GraphSAGE, GAT, and EvolveGCN outperform feature-only baselines on the Elliptic Bitcoin Dataset is widely cited but has not been rigorously stress-tested under a leakage-free evaluation protocol. We perform a seed-matched inductive-versus-transductive comparison and find that this consensus does not hold. Under a strictly inductive protocol, Random Forest on raw features achieves F1 = 0.821 and outperforms all evaluated GNNs, while GraphSAGE reaches F1 = 0.689 +/- 0.017. A paired controlled experiment reveals a 39.5-point F1 gap attributable to training-time exposure to test-period adjacency. Additionally, edge-shuffle ablations show that randomly wired graphs outperform the real transaction graph, indicating that the dataset's topology can be misleading under temporal distribution shift. Hybrid models combining GNN embeddings with raw features provide only marginal gains and remain substantially below feature-only baselines. We release code, checkpoints, and a strict-inductive protocol to enable reproducible, leakage-free evaluation.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers