Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection
Mohamad Khajezade, Fatemeh H. Fard, Mohamed Sami Shehata
TLDR
This paper introduces a knowledge distillation framework with response stabilization to make compact open-source models reliable for cross-language code clone detection.
Key contributions
- Proposes a knowledge distillation framework to transfer reasoning from large LLMs to compact models for X-CCD.
- Constructs reasoning-oriented synthetic training data using cross-language code pairs from Project CodeNet.
- Introduces response stabilization methods like forced prompting and classification heads to improve model reliability.
- Demonstrates improved reliability and predictive performance, reducing inference time for X-CCD.
Why it matters
Cross-language code clone detection is crucial but challenging, especially with compact models. This work makes open-source models practical and reliable for this task by transferring advanced reasoning capabilities. It addresses key concerns like cost and reproducibility, enabling wider adoption of semantic code analysis.
Original Abstract
Cross-language code clone detection (X-CCD) is challenging because semantically equivalent programs written in different languages often share little surface similarity. Although large language models (LLMs) have shown promise for semantic clone detection, their use as black-box systems raises concerns about cost, reproducibility, privacy, and unreliable output formatting. In particular, compact open-source models often struggle to follow reasoning-oriented prompts and to produce outputs that can be consistently mapped to binary clone labels. To address these limitations, we propose a knowledge distillation framework that transfers reasoning capabilities from DeepSeek-R1 into compact open-source student models for X-CCD. Using cross-language code pairs derived from Project CodeNet, we construct reasoning-oriented synthetic training data and fine-tune Phi3 and Qwen-Coder with LoRA adapters. We further introduce response stabilization methods, including forced conclusion prompting, a binary classification head, and a contrastive classification head, and evaluate model behavior using both predictive metrics and response rate. Experiments on Python--Java, Rust--Java, Rust--Python, and Rust--Ruby show that knowledge distillation consistently improves the reliability of compact models and often improves predictive performance, especially under distribution shift. In addition, classification-head variants substantially reduce inference time compared to generation-based inference. Overall, our results show that reasoning-oriented distillation combined with response stabilization makes compact open-source models more practical and reliable for X-CCD detection.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.