Transcoda: End-to-End Zero-Shot Optical Music Recognition via Data-Centric Synthetic Training

May 11, 20262605.10835

cs.CVcs.LG

TLDR

Transcoda is a zero-shot OMR system using advanced synthetic data, normalized encodings, and grammar-based decoding to achieve state-of-the-art performance.

Key contributions

Introduces an advanced synthetic data generation pipeline for OMR.
Normalizes **kern encoding to create unique, easier-to-learn representations.
Uses grammar-based decoding to ensure syntactically correct music transcriptions.
Achieves state-of-the-art OMR performance with a compact 59M-parameter model.

Why it matters

Optical Music Recognition (OMR) struggles with limited real-world data and complex music encodings. Transcoda overcomes these challenges through innovative synthetic training and encoding normalization. This significantly boosts OMR accuracy and efficiency, making music transcription more reliable.

Original Abstract

Optical Music Recognition (OMR), the task of transcribing sheet music into a structured textual representation, is currently bottlenecked by a lack of large-scale, annotated datasets of real scans. This forces models to rely on either few-shot transfer or synthetic training pipelines that remain overly simplistic. A secondary challenge is encoding non-uniqueness: in the popular Humdrum **kern format for transcribing music, multiple different text encodings can render into the same visual sheet music. This one-to-many mapping creates a harder learning task and introduces high uncertainty during decoding. We propose Transcoda, an OMR system built on (i) an advanced synthetic data generation pipeline, (ii) a normalization of the **kern encoding to enforce a unique normal form and (iii) grammar-based decoding to ensure the syntactic correctness of the output. This approach allows us to train a compact 59M-parameter model in just 6 hours on a single GPU that outperforms billion-parameter baselines. Transcoda achieves the best score among state of the art baselines on a newly curated benchmark of synthetically rendered scores at 18.46% OMR-NED (compared to 43.91% for the next-best system, Legato) and reduces the error rate on historical Polish scans to 63.97% OMR-NED (down from 80.16% for SMT++).

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers