LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Xinfeng Zhang 2,
Zeng Zhao2,
Bai Liu2
Changjie Fan2,
Zhipeng Hu2,
1 Zhejiang University 2 Fuxi AI Lab, NetEase Inc.
Teaser Image

Image generation comparison using short and dense prompts across SDXL, Playground v2, PixArt-alpha, and our proposed LLM4GENSDXL. The colored text denotes critical entities or attributes.

Abstract

Diffusion Models have exhibited substantial success in text-to-image generation. However, they often encounter challenges when dealing with complex and dense prompts that involve multiple objects, attribute binding, and long descriptions. This paper proposes a framework called LLM4GEN, which enhances the semantic understanding ability of text-to-image diffusion models by leveraging the semantic representation of Large Language Models (LLMs). Through a specially designed Cross-Adapter Module (CAM) that combines the original text features of text-to-image models with LLM features, LLM4GEN can be easily incorporated into various diffusion models as a plug-and-play component and enhances text-to-image generation. Additionally, to facilitate the complex and dense prompts semantic understanding, we develop a LAION-refined dataset, consisting of 1 million (M) text-image pairs with improved image descriptions. We also introduce DensePrompts which contains 7,000 dense prompts to provide a comprehensive evaluation for the text-to-image generation task. With just 10% of the training data required by recent ELLA, LLM4GEN significantly improves the semantic alignment of SD1.5 and SDXL, demonstrating increases of 7.69% and 9.60% in color on T2I-CompBench, respectively. The extensive experiments on DensePrompts also demonstrate that LLM4GEN surpasses existing state-of-the-art models in terms of sample quality, image-text alignment, and human evaluation. The model and dataset will be released to the community later.

Method

The proposed LLM4GEN, which contains a Cross-Adapter Module (CAM) and the UNet, is illustrated in Fig 1. (a). In this paper, we explore stable diffusion as the base text-to-image diffusion model, and the vanilla text encoder is from CLIP. LLM4GEN leverages the strong capability of LLMs to assist in text-to-image generation. The CAM extracts the representation of a given prompt via the combination of LLM and CLIP text encoder. The fused text embedding is enhanced by leveraging the pre-trained knowledge of LLMs through the simple yet effective CAM. By feeding the fused text embedding, LLM4GEN iteratively denoises the latent vectors with the UNet and decodes the final vector into an image with the VAE.

Image0

Dataset Construction

Image0

Experimental results on MS-COCO benchmark and User Study.

Image0

More visualization results

Image0