Discrete Preference Learning for Personalized Multimodal Generation
Yuting Zhang, Ying Sun, Dazhong Shen, Ziwei Xie, Feng Liu + 4 more
TLDR
DPPMG learns discrete modal-specific preferences to generate personalized and consistent multimodal content from user interactions.
Key contributions
- Introduces personalized multimodal generation from user interactions.
- Presents DPPMG, a two-stage framework for discrete preference learning.
- Uses a modal-specific GNN to learn and quantize user preferences into discrete tokens.
- Employs a cross-modal consistent and personalized reward for fine-tuning.
Why it matters
Existing models lack accurate preference modeling and generate unimodal content. This paper addresses these gaps by enabling personalized and consistent multimodal content generation. It offers a novel approach for tailoring generative AI to complex user needs.
Original Abstract
The emergence of generative models enables the creation of texts and images tailored to users' preferences. Existing personalized generative models have two critical limitations: lacking a dedicated paradigm for accurate preference modeling, and generating unimodal content despite real-world multimodal-driven user interactions. Therefore, we propose personalized multimodal generation, which captures modal-specific preferences via a dedicated preference model from multimodal interactions, and then feeds them into downstream generators for personalized multimodal content. However, this task presents two challenges: (1) Gap between continuous preferences from dedicated modeling and discrete token inputs intrinsic to generator architectures; (2) Potential inconsistency between generated images and texts. To tackle these, we present a two-stage framework called Discrete Preference learning for Personalized Multimodal Generation (DPPMG). In the first stage, to accurately learn discrete modal-specific preferences, we introduce a modal-specific graph neural network (a dedicated preference model) to learn users' modal-specific preferences, which preferences are then quantized into discrete preference tokens. In the second stage, the discrete modal-specific preference tokens are injected into downstream text and image generators. To further enhance cross-modal consistency while preserving personalization, we design a cross-modal consistent and personalized reward to fine-tune token-associated parameters. Extensive experiments on two real-world datasets demonstrate the effectiveness of our model in generating personalized and consistent multimodal content.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.