ArXiv TLDR

Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation

🐦 Tweet
2605.04998

Jinju Lee

cs.SDcs.IRcs.LG

TLDR

This paper investigates optimal data mix ratios for fine-tuning a pop-trained chord generation model to jazz, balancing new genre acquisition with old genre retention.

Key contributions

  • Treats chord generation as a standalone task, not just a conditioning component.
  • Fine-tunes a pop-pretrained Music Transformer for jazz, using varying pop rehearsal data.
  • Shows pop accuracy recovers to baseline with ~2.5K pop rehearsal samples (1.65x jazz data).
  • Observes that metric-best models aren't always perceptually preferred, highlighting stylistic identity.

Why it matters

This paper offers empirical insights into optimal data mixing for genre adaptation in music generation, particularly for chord progressions. It helps developers balance old and new domain knowledge, improving practical applications of co-creation tools.

Original Abstract

Chord progression generation is practically important but understudied. Most large-scale symbolic music systems target melody, multi-track arrangement, or audio synthesis, and chord-only models tend to be relegated to conditioning components inside larger pipelines. This paper treats chord generation as a standalone task and addresses a question that arises whenever such a model is adapted across genres: how much old-domain data must be retained during fine-tuning to acquire a new domain without forgetting the old? I study jazz fine-tuning starting from a pop-pretrained 25M-parameter Music Transformer (84.24% top-1 chord accuracy on a held-out pop test set). The available jazz corpus is an order of magnitude smaller than the pop corpus, so every fine-tune run uses all 1,513 jazz training sequences. The swept variable is the volume of pop "rehearsal" data mixed alongside, taking values in {0, 1K, 2.5K, 5K, 10K}. Every fine-tuned model gains 7 to 9 points of jazz top-1. Pop accuracy collapses by 2.14 points under jazz-only fine-tuning, recovers to baseline at approximately 2.5K rehearsal samples (1.65x the jazz volume), and saturates beyond that point. A complementary observation: the metric-best run (F3, 2.5K mix) is not always the perceptually preferred one. The pop-leaning (10K) and jazz-leaning (1K) endpoints carry more committed stylistic identities that the author more often selects as finished output in informal listening. I discuss what this suggests for music co-creation tools but make no perceptual claim, since no formal listening study has been conducted. All six checkpoints are released on the HuggingFace Hub at https://huggingface.co/PearlLeeStudio.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.