Quantifying the human visual exposome with vision language models

May 5, 20262605.03863

Christian Rominger, Andreas R. Schwerdtfeger, Malay Gaherwar Singh, Dimitri Khudyakow, Elizabeth A. M. Michels + 3 more

cs.AIcs.CV

TLDR

This paper introduces a scalable method using VLMs and LLMs to objectively quantify the human visual exposome and its link to mental health.

Key contributions

Quantified human visual experience using Vision Language Models (VLMs) on participant-generated photos.
VLM-derived "greenness" robustly predicted momentary affect and chronic stress, consistent with benchmarks.
Developed an LLM pipeline to extract ~1000 mental health-linked environmental features from 7M+ publications.
Demonstrated VLM-extracted context ratings significantly correlated with affect and stress in real-world imagery.

Why it matters

This research establishes a novel, objective, and scalable paradigm for visual exposomics. It moves beyond subjective reports and coarse proxies, enabling high-throughput decoding of how our visible environment influences mental health. This could lead to better understanding and interventions for mental well-being.

Original Abstract

The visual environment is a fundamental yet unquantified determinant of mental health. While the concept of the environmental exposome is well established, current methods rely on coarse geospatial proxies or biased self reports, failing to capture the first person visual context of daily life. We addressed this gap by coupling ecological momentary assessment with vision language models (VLMs) to quantify the semantic richness of human visual experience. Across 2674 participant generated photographs, VLM derived estimates of greenness robustly predicted momentary affect and chronic stress, consistent with established benchmarks. We then developed a semi autonomous large language model (LLM) based pipeline that mined over seven million scientific publications to extract nearly 1000 environmental features empirically linked to mental health. When applied to real world imagery, up to 33 percent of VLM extracted context ratings significantly correlated with affect and stress. These findings establish a scalable objective paradigm for visual exposomics, enabling high throughput decoding of how the visible world is associated with mental health.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers