FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records

April 24, 20262604.22534

Hojjat Karami, David Atienza, Jean-Philippe Thiran, Anisoara Ionescu

cs.LGcs.AI

TLDR

FeatEHR-LLM uses LLMs to automate clinically meaningful feature engineering from irregular EHR data, significantly improving prediction tasks.

Key contributions

Leverages LLMs for automated, clinically meaningful feature engineering from irregular EHR data.
LLM operates on dataset schemas/task descriptions, not raw patient data, ensuring privacy.
Uses tool-augmented generation to handle irregular temporal data and informative sparsity.
Achieves highest AUROC on 7 of 8 clinical tasks, improving up to 6% over strong baselines.

Why it matters

EHR feature engineering is complex due to irregular data. FeatEHR-LLM automates this using LLMs, generating clinically meaningful features while protecting privacy. This significantly improves clinical prediction tasks, addressing a major challenge in healthcare AI.

Original Abstract

Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either lack clinical domain awareness or assume clean, regularly sampled inputs, limiting their applicability to real-world EHR data. We present \textbf{FeatEHR-LLM}, a framework that leverages Large Language Models (LLMs) to generate clinically meaningful tabular features from irregularly sampled EHR time series. To limit patient privacy exposure, the LLM operates exclusively on dataset schemas and task descriptions rather than raw patient records. A tool-augmented generation mechanism equips the LLM with specialized routines for querying irregular temporal data, enabling it to produce executable feature-extraction code that explicitly handles uneven observation patterns and informative sparsity. FeatEHR-LLM supports both univariate and multivariate feature generation through an iterative, validation-in-the-loop pipeline. Evaluated on eight clinical prediction tasks across four ICU datasets, our framework achieves the highest mean AUROC on 7 out of 8 tasks, with improvements of up to 6 percentage points over strong baselines. Code is available at github.com/hojjatkarami/FeatEHR-LLM.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers