Deep Graph-Language Fusion for Structure-Aware Code Generation
Mert Tiftikci, Amir Molzam Sharifloo, Mira Mezini
TLDR
CGFuse deeply fuses graph features into LLMs at the token level, significantly boosting structure-aware code generation performance.
Key contributions
- Introduces CGFuse, a framework for token-level integration of graph-derived features into PLMs.
- Combines GNNs and LLMs to explicitly preserve fine-grained structural code information.
- Achieves 10-16% BLEU and 6-11% CodeBLEU improvements in code generation tasks.
- Addresses the architectural mismatch between sequential LLMs and graph-structured code.
Why it matters
Current LLMs struggle with the structured nature of code due to architectural limitations. CGFuse introduces a novel deep fusion method, enabling LLMs to explicitly leverage graph-based structural information. This significantly advances the field toward more robust and capable AI-driven software development.
Original Abstract
Pre-trained Language Models (PLMs) have the potential to transform software development tasks. However, despite significant advances, current PLMs struggle to capture the structured and relational attributes of code, such as control flow and data dependencies. This limitation is rooted in an architectural mismatch: whereas code structure is best represented by graphs, transformer-based LLMs process input as sequential token patterns and therefore lack explicit structural awareness. While recent research has explored integrating graph-based code representations using techniques like graph feature extraction, retrieval-augmented generation, and prompt engineering, existing approaches suffer from information loss during dense feature extraction or prompt encoding; notably, the potential of deep, token-level fusion of graph features within model internals has not been systematically explored. In this paper, we initiate such an exploration by introducing CGFuse, a novel framework that enables token-level integration of graph-derived representations by infusing learned graph features directly into the intermediate layers of pre-trained language models. CGFuse combines a graph neural network (GNN) with a language model to explicitly preserve and exploit fine-grained structural information from code graphs, including abstract syntax trees and data-flow graphs. We systematically evaluate CGFuse across multiple LLMs, demonstrating up to 10-16% BLEU and 6-11% CodeBLEU improvements in code generation performance. These results highlight the potential of deep graph-PLM integration to advance the field toward more robust, capable AI-driven software development.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.