BMdataset: A Musicologically Curated LilyPond Dataset
Matteo Spanio, Ilay Guler, Antonio Rodà
TLDR
BMdataset and LilyBERT enable music understanding from LilyPond scores, showing curated data outperforms large noisy corpora for symbolic music.
Key contributions
- Presents BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) from Baroque manuscripts.
- Introduces LilyBERT, a CodeBERT-based encoder adapted for symbolic music with 115 LilyPond-specific tokens.
- Demonstrates that small, expertly curated datasets (BMdataset) outperform large, noisy corpora for music understanding.
- Achieves best results (84.3% accuracy) by combining broad pre-training with domain-specific fine-tuning.
Why it matters
Symbolic music research has largely ignored text-based formats like LilyPond. This paper fills that gap by providing a new dataset and model. It also highlights the critical value of small, expertly curated datasets over large, noisy ones for effective music understanding.
Original Abstract
Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by experts directly from original Baroque manuscripts, with metadata covering composer, musical form, instrumentation, and sectional attributes. Building on this resource, we introduce LilyBERT (weights can be found at https://huggingface.co/csc-unipd/lilybert), a CodeBERT-based encoder adapted to symbolic music through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Linear probing on the out-of-domain Mutopia corpus shows that, despite its modest size (~90M tokens), fine-tuning on BMdataset alone outperforms continuous pre-training on the full PDMX corpus (~15B tokens) for both composer and style classification, demonstrating that small, expertly curated datasets can be more effective than large, noisy corpora for music understanding. Combining broad pre-training with domain-specific fine-tuning yields the best results overall (84.3% composer accuracy), confirming that the two data regimes are complementary. We release the dataset, tokenizer, and model to establish a baseline for representation learning on LilyPond.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.