Deep Graph-Language Fusion for Structure-Aware Code Generation

May 5, 20262605.03689

Mert Tiftikci, Amir Molzam Sharifloo, Mira Mezini

cs.SE

TLDR

CGFuse deeply fuses graph features into LLMs at the token level, significantly boosting structure-aware code generation performance.

Key contributions

Introduces CGFuse, a framework for token-level integration of graph-derived features into PLMs.
Combines GNNs and LLMs to explicitly preserve fine-grained structural code information.
Achieves 10-16% BLEU and 6-11% CodeBLEU improvements in code generation tasks.
Addresses the architectural mismatch between sequential LLMs and graph-structured code.

Why it matters

Current LLMs struggle with the structured nature of code due to architectural limitations. CGFuse introduces a novel deep fusion method, enabling LLMs to explicitly leverage graph-based structural information. This significantly advances the field toward more robust and capable AI-driven software development.

Original Abstract

Pre-trained Language Models (PLMs) have the potential to transform software development tasks. However, despite significant advances, current PLMs struggle to capture the structured and relational attributes of code, such as control flow and data dependencies. This limitation is rooted in an architectural mismatch: whereas code structure is best represented by graphs, transformer-based LLMs process input as sequential token patterns and therefore lack explicit structural awareness. While recent research has explored integrating graph-based code representations using techniques like graph feature extraction, retrieval-augmented generation, and prompt engineering, existing approaches suffer from information loss during dense feature extraction or prompt encoding; notably, the potential of deep, token-level fusion of graph features within model internals has not been systematically explored. In this paper, we initiate such an exploration by introducing CGFuse, a novel framework that enables token-level integration of graph-derived representations by infusing learned graph features directly into the intermediate layers of pre-trained language models. CGFuse combines a graph neural network (GNN) with a language model to explicitly preserve and exploit fine-grained structural information from code graphs, including abstract syntax trees and data-flow graphs. We systematically evaluate CGFuse across multiple LLMs, demonstrating up to 10-16% BLEU and 6-11% CodeBLEU improvements in code generation performance. These results highlight the potential of deep graph-PLM integration to advance the field toward more robust, capable AI-driven software development.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers