RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment

Zutao Jiang*, Guian Fang*, Jianhua Han, Guansong Lu,
Hang Xu, Shengcai Liao, Xiaojun Chang, Xiaodan Liang
* These two authors contribute equally to this work.  Xiaodan Liang is the corresponding author. Zutao Jiang is with Pengcheng Laboratory, Shenzhen, China. (E-mail:taozujiang@gmail.com). Guian Fang and Xiaodan Liang are with Sun Yat-sen University. (E-mail:{ fanggan@mail2.sysu.edu.cn, xdliang328@gmail.com}). Jianhua Han, Guansong Lu and Hang Xu are with Huawei Noah’s Ark Lab. (E-mail:{hanjianhua4@huawei.com , luguansong@huawei.com, xu.hang@huawei.com}). Xiaojun Chang is with the Australian Artificial Intelligence Institute, University of Technology Sydney. (E-mail:xiaojun.chang@uts.edu.au) Shengcai Liao is with the College of Information Technology (CIT), United Arab Emirates University (UAEU), the United Arab Emirates.
(E-mail:scliao@ieee.org).
Abstract

Recent advances in text-to-image diffusion models have achieved remarkable success in generating high-quality, realistic images from textual descriptions. However, these approaches have faced challenges in precisely aligning the generated visual content with the textual concepts described in the prompts. In this paper, we propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff, aimed at improving the alignment between text and images in text-to-image diffusion models. In the coarse semantic re-alignment phase, a novel caption reward, leveraging the BLIP-2 model, is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt. Subsequently, the fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view. Experimental results on the MS-COCO and ViLG-300 datasets demonstrate that the proposed two-stage coarse-to-fine semantic re-alignment method outperforms other baseline re-alignment techniques by a substantial margin in both visual quality and semantic similarity with the input prompt.

Index Terms:
Text-to-Image Generation, Diffusion Model, Fine Semantic Re-alignment

I Introduction

Text-to-image diffusion models [1, 2, 3, 4] have witnessed significant advancements in recent years. These models can generate high-quality and diverse images based on the given input texts. The ability to convert textual descriptions into realistic images has enormous potential in various applications such as graphic design, computer vision, and creative writing. Despite several text-to-image diffusion models have been deployed in real-world applications such as Imagen [1], DALL-E 2 [2], Stable Diffusion [3], Midjourney111https://www.midjourney.com/, and Versatile Diffusion [4], the generated images from these models are not perfect [5] as displayed in Figure 1. The main challenge faced by existing text-to-image diffusion models is achieving precise alignment between the generated image and the input caption. Specifically, these models often encounter difficulties in accurately capturing the attributes and relationships of the objects described in the input text.

Refer to caption
Figure 1: Visual comparison of generated images from various text-to-image diffusion models (ImageReward [5], Structure Diffusion [6], Stable Diffusion XL [3], PixArt-α𝛼\alphaitalic_α [7] and MidJourney1). The motivation behind our proposed RealignDiff is to address the misalignments and semantic discrepancies observed in prior methods. From top to bottom: Missing Main Objects (e.g., the green traffic light in the first row is absent); Attribute Misalignment (e.g., the second row fails to paint red on the top of the yellow vase); Attribute Interchange (e.g., the third row, intended to be black and white, is not in monochrome, with the notable absence of the black toilet seat as evidence of the mix-up). RealignDiff endeavors to fix these inconsistencies, ensuring images that are more aligned with the provided textual prompts.

To tackle the issue of semantic misalignment in text-to-image diffusion models, some researchers have introduced pre-trained image-text models, such as CLIP [8] and BLIP [9], to calculate the semantic guidance [10, 11]. Recently, ImageReward [5] has also been proposed to solve both the text-image alignment and human aesthetic problems, which is trained using a comprehensive collection and annotation pipeline that leverages expert preference data. However, as shown in Figure 1, while ImageReward is capable of aligning images and text at a coarse level, it tends to overlook fine-grained alignment which is crucial for text-matched image generation, such as precisely aligning attributes, quantities, and relationships between objects described in the given text. To solve the fine-grained semantic alignment problem, Structure Diffusion [6] incorporates language structures into the cross-attention layers. While Structure Diffusion can address the attribute binding problem, it tends to miss the main objects described in the input prompt.

In this paper, we present a novel two-stage coarse-to-fine semantic re-alignment method, called RealignDiff, to generate images that more accurately align with user-provided textual descriptions within text-to-image diffusion models. During the coarse semantic re-alignment stage, we propose a novel caption reward to optimize the text-to-image diffusion model from a global semantic view. Specifically, the caption reward generates a corresponding detailed caption that depicts all crucial contents in the synthetic image via a BLIP-2 model and then calculates the reward score by measuring the similarity between the generated caption and the given prompt. The elaborated caption can give more guidance about whether the surrounding concepts and context in the image are reasonable given the input text prompt. It is noteworthy that only the coarse semantic re-alignment stage may not be sufficient to capture all the desired characteristics of the generated images, especially in cases where the input texts describe complex and diverse scenes. To sense the correctness of local semantic parts, we further propose the fine semantic re-alignment. In the fine semantic re-alignment stage, we present a local dense caption generation module and a re-weighting attention modulation module from the local semantic view to refine the previously generated images. The local dense caption generation module generates the mask, detailed caption, and the corresponding likelihood score of each object appearing in the generated images. Armed with the generated detailed captions and the corresponding scores, the re-weighting attention modulation module can re-align the generated captions and the segmented parts of the generated images.

Experimental results on the MS-COCO and ViLG-300 datasets demonstrate that the proposed two-stage coarse-to-fine semantic re-alignment method outperforms other baseline re-alignment techniques by a substantial margin in both visual quality and semantic similarity with the input prompt. Our approach opens up new avenues for research in this exciting field by providing a more accurate and precise alignment mechanism that can better capture the semantic meaning of the input text and generate high-quality images. Our main contributions are summarized as follows:

  • We propose a two-stage coarse-to-fine semantic re-alignment method for text-to-image diffusion models. The coarse semantic re-alignment stage ensures that the objects described in the given text appear in the generated images. The fine semantic re-alignment stage accurately captures the attributes and relationships of the objects in the input text.

  • We propose a novel caption reward and a novel local dense caption generation module. The caption reward measures the similarity between the generated caption and the given text prompt. The local dense caption generation module provides guidance regarding the attributes and spatial arrangements of objects.

  • Experimental results on the MS-COCO [12] and ViLG-300 [13] datasets demonstrate that RealignDiff can better align the semantics of the generated image from the text-to-image diffusion model with the given text prompt, achieving the best performance compared to other baseline methods.

II Related Work

II-A Text-to-Image Generation.

Text-to-image generation aims to generate images given input text descriptions. Along with the progress on generative models, including generative adversarial networks (GANs  [14]), auto-regressive model  [15] and diffusion model [16], there are numbers of works for text-to-image generation. Among them, GANs are first adopted for text-to-image generation [17] and later many GAN-based models are proposed for better visual fidelity and caption similarity [18, 19, 20, 21, 22, 23, 24, 25, 26]. However, GANs suffer from the well-known problem of mode-collapse and unstable training processes. To solve these problems, another line of works explore applying Transformer-based auto-regressive model for text-to-image generation [27, 28, 29, 30, 31, 32, 33] with a discrete VAE [34, 35, 36] model for tokenizing the input images and a Transformer [15] model for fitting the joint distribution of text tokens and image tokens. Recent works adopt diffusion model for text-to-image generation [37, 38, 2, 1, 3, 4], which learns to predict the added noise of noised images and generates images from pure noise by iteratively predict added noise and remove it. Among them, in order to reduce the computational overhead of large-scale text-to-image generation models, Stable Diffusion  [3] proposed to first encode the input images as low-dimension latent codes and then adopt a diffusion model to generate these latent codes conditioned on the input texts. Although significant progress in high-quality text-to-image generation has been achieved, problems including misalignment with human preference and misalignment with input texts still remain to be solved.

II-B Alignment of Text-to-Image Generation Models.

Some works [39, 40, 41, 5, 42] are proposed to align a text-to-image generation model with human preference and aesthetic quality. [40] first learn a reward model with the human feedback assessing model outputs and then finetune a text-to-image model by maximizing reward-weighted likelihood to improve image-text alignment. Similarly, [41] took the human aesthetic preference into account and proposed to learn a human preference reward model. ImageReward [5] proposed a general-purpose text-to-image human preference reward model, covering text-image alignment, body problems, human aesthetics, toxicity, and biases. Promptist [39] proposed prompt adaptation, i.e., training a language model to generate a better prompt given the origin prompt. They utilize the CLIP model and aesthetic predictor model as the reward model and perform supervised fine-tuning under the reinforcement learning paradigm. On the other hand, to circumvent the problems of inefficiencies and instabilities of Reinforcement Learning from Human Feedback (RLHF [43]), [42] introduce Reward ranked Fine-Tuning (RAFT) to align generative models more effectively. However, RAFT is prone to overfitting as the number of iterations increases. More recently, Xu et al. [5] have developed reward feedback learning (ReFL) to optimize text-to-image diffusion models against a reward function, which has demonstrated its effectiveness in achieving better alignment. However, the existing reward models do not take both coarse-grained and fine-grained image-text semantic alignment into account. In this paper, we propose RealignDiff to improve the alignment between text and images in text-to-image diffusion models from the global and local views.

III Method

Refer to caption
Figure 2: The framework of our RealignDiff approach. (a) Coarse Semantic Re-alignment enables the objects described in the given text to appear in the generated images. (b) Fine Semantic Re-alignment accurately captures the attributes and relationships of the objects. (c) Caption Reward measures the similarity between the generated caption and the given prompt. (d) The local dense caption generation module provides guidance regarding the attributes and spatial arrangements of objects within the fine semantic re-alignment stage.

Figure 2 shows the pipeline of our RealignDiff approach for boosting the text-to-image diffusion models. In this section, we first introduce the preliminary knowledge of the text-to-image diffusion model. Then we present the coarse semantic re-alignment method, including the caption reward and reward feedback learning framework. Finally, we introduce the fine semantic re-alignment method, including the local dense caption generation and the re-weighting attention modulation modules.

III-A Preliminary: Text-to-image diffusion model

Given an image sampled from the real image distribution x0q(x0)similar-tosubscript𝑥0𝑞subscript𝑥0x_{0}\sim q(x_{0})italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), diffusion models first produce a Markov chain of latent variables x1,,xTsubscript𝑥1subscript𝑥𝑇x_{1},...,x_{T}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT by progressively adding Gaussian noise to the image according to some variance schedule given by βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as follows:

q(xtxt1)=𝒩(1βtxt1,βtI),𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1𝒩1subscript𝛽𝑡subscript𝑥𝑡1subscript𝛽𝑡𝐼q\left(x_{t}\mid x_{t-1}\right)=\mathcal{N}\left(\sqrt{1-\beta_{t}}x_{t-1},% \beta_{t}I\right),italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_I ) , (1)

and then learn a model to approximate the true posterior:

pθ(xt1xt)=𝒩(μθ(xt,t),Σθ(xt,t)),subscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥𝑡𝒩subscript𝜇𝜃subscript𝑥𝑡𝑡subscriptΣ𝜃subscript𝑥𝑡𝑡p_{\theta}\left(x_{t-1}\mid x_{t}\right)=\mathcal{N}\left(\mu_{\theta}\left(x_% {t},t\right),\Sigma_{\theta}\left(x_{t},t\right)\right),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) , (2)

to perform the reverse denoising process for image generation: starting from a random noise xT𝒩(0,I)similar-tosubscript𝑥𝑇𝒩0𝐼x_{T}\sim\mathcal{N}(0,I)italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I ) and gradually reducing the noise to finally get a real image x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. While a tractable variational lower-bound VLBsubscript𝑉𝐿𝐵\mathcal{L}_{VLB}caligraphic_L start_POSTSUBSCRIPT italic_V italic_L italic_B end_POSTSUBSCRIPT on logpθ(x0)subscript𝑝𝜃subscript𝑥0\log p_{\theta}(x_{0})roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) can be used to optimize μθsubscript𝜇𝜃\mu_{\theta}italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and ΣθsubscriptΣ𝜃\Sigma_{\theta}roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, to achieve better results, [16] instead adopt a denoising network ϵθ(xt,t)subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}(x_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) which predicts the added noise of a noisy image xtq(xt|x0)similar-tosubscript𝑥𝑡𝑞conditionalsubscript𝑥𝑡subscript𝑥0x_{t}\sim q(x_{t}|x_{0})italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and adopts the following training objective:

=𝔼x0q(x0),ϵ𝒩(0,I),t[1,T]ϵϵθ(xt,t)2,subscript𝔼formulae-sequencesimilar-tosubscript𝑥0𝑞subscript𝑥0formulae-sequencesimilar-toitalic-ϵ𝒩0𝐼similar-to𝑡1𝑇superscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡2\mathcal{L}=\mathbb{E}_{x_{0}\sim q\left(x_{0}\right),\epsilon\sim\mathcal{N}(% 0,I),t\sim[1,T]}\left\|\epsilon-\epsilon_{\theta}\left(x_{t},t\right)\right\|^% {2},caligraphic_L = blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) , italic_t ∼ [ 1 , italic_T ] end_POSTSUBSCRIPT ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (3)

where t𝑡titalic_t is uniformly sampled from {1,,T}1𝑇\{1,...,T\}{ 1 , … , italic_T }. For a text-to-image generation, the denoising network receives the input text tpsubscript𝑡𝑝t_{p}italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT as extra conditional input and is denoted as ϵθ(xt,tp,t)subscriptitalic-ϵ𝜃subscript𝑥𝑡subscript𝑡𝑝𝑡\epsilon_{\theta}(x_{t},t_{p},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_t ).

We adopt Stable Diffusion [3] as our baseline text-to-image generation model. In this model, a real image is first down-sampled 8 times as a lower-dimension latent code x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with an autoencoder model and the denoising network ϵθ(xt,tp,t)subscriptitalic-ϵ𝜃subscript𝑥𝑡subscript𝑡𝑝𝑡\epsilon_{\theta}(x_{t},t_{p},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_t ) is parameterized as a Unet [44] network, where embedding of time step t𝑡titalic_t is injected with adaptive normalization layers and embedding of input text tpsubscript𝑡𝑝t_{p}italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is injected with cross-attention layers. However, the Stable Diffusion model fails to perform precise alignment between the text concept and generated images since it’s trained only with the global alignment between the text and images.

III-B Coarse-to-fine Semantic Re-alignment 

In this subsection, we first introduce the coarse semantic re-alignment method and then present the fine semantic re-alignment method.

III-B1 Coarse Semantic Re-alignment.

To ensure the objects described in the given text appear in the generated image, we propose the coarse semantic re-alignment method, including the caption reward and the reward feedback learning framework.

Caption Reward. The caption reward is proposed to improve the consistency between the synthetic caption of the generated image and the given text prompt. Specifically, given an image generated by a text-to-image diffusion model, we first obtain the corresponding caption tgsubscript𝑡𝑔t_{g}italic_t start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT using the pre-trained Blip-2 model. Then we compute the similarity between the embeddings of generated caption tgsubscript𝑡𝑔t_{g}italic_t start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and the corresponding text prompt tpsubscript𝑡𝑝t_{p}italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT as our caption reward score. Note that we utilize a pre-trained BLIP-2 [45] text encoder fenc()subscriptf𝑒𝑛𝑐{{\mathop{\rm f}\nolimits}_{enc}}(\cdot)roman_f start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT ( ⋅ ) to convert the captions into the text embeddings. Formally, the caption reward score 𝐑capsubscript𝐑𝑐𝑎𝑝{\bf{R}}_{cap}bold_R start_POSTSUBSCRIPT italic_c italic_a italic_p end_POSTSUBSCRIPT can be calculated as follows:

𝐑cap=fenc(tg)fenc(tp)fenc(tg)fenc(tp).subscript𝐑𝑐𝑎𝑝subscriptf𝑒𝑛𝑐subscript𝑡𝑔subscriptf𝑒𝑛𝑐subscript𝑡𝑝normsubscriptf𝑒𝑛𝑐subscript𝑡𝑔normsubscriptf𝑒𝑛𝑐subscript𝑡𝑝{{\bf{R}}_{cap}}=\frac{{{{\mathop{\rm f}\nolimits}_{enc}}({t_{g}})\cdot{{% \mathop{\rm f}\nolimits}_{enc}}({t_{p}})}}{{\left\|{{{\mathop{\rm f}\nolimits}% _{enc}}({t_{g}})}\right\|\left\|{{{\mathop{\rm f}\nolimits}_{enc}}({t_{p}})}% \right\|}}.bold_R start_POSTSUBSCRIPT italic_c italic_a italic_p end_POSTSUBSCRIPT = divide start_ARG roman_f start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) ⋅ roman_f start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_ARG start_ARG ∥ roman_f start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) ∥ ∥ roman_f start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ∥ end_ARG . (4)

Note that while the caption reward can effectively promote consistency between the generated captions and input prompts, it may not capture all the desired characteristics of the generated images, especially in cases where the input texts describe complex and diverse scenes.

Reward Feedback Learning. Reward Feedback Learning (ReFL) is designed to optimize text-to-image diffusion models by leveraging a reward function. Within this framework, a caption reward is incorporated to enable coarse semantic re-alignment. ReFL facilitates the direct optimization of text-to-image diffusion models by back-propagating gradients to a randomly selected intermediate step t𝑡titalic_t during the denoising process. The rationale behind this random selection of t𝑡titalic_t is significant: solely retaining the gradient information from the last denoising step leads to pronounced training instability and suboptimal results. Instead of progressively reducing noise to generate an image x0subscript𝑥0{x_{0}}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from an intermediate state xtsubscript𝑥𝑡{x_{t}}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT via a sequential transformation process xtxt1x0subscript𝑥𝑡subscript𝑥𝑡1subscript𝑥0{x_{t}}\to{x_{t-1}}\to\ldots\to{x_{0}}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT → … → italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, ReFL employs an alternative approach. It directly predicts x0superscriptsubscript𝑥0{x_{0}}^{\prime}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from xtsubscript𝑥𝑡{x_{t}}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using the transformation xtx0subscript𝑥𝑡superscriptsubscript𝑥0{x_{t}}\to{x_{0}}^{\prime}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT during the fine-tuning of text-to-image diffusion models. This method is grounded in the insightful observation that the caption reward scores for generations x0superscriptsubscript𝑥0{x_{0}}^{\prime}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT after a sufficient number of denoising steps (typically, t30𝑡30t\geq 30italic_t ≥ 30), provide effective feedback for improving model performance.

To address the challenges of rapid overfitting and to enhance stability during fine-tuning, a re-weighting scheme is applied to the ReFL loss, along with regularization using the pre-training loss. The overall loss function is defined as:

total=λreward+pre=λϕ(𝐑𝐜𝐚𝐩(tp,gθ(tp)))+𝔼x0q(x0),ϵ𝒩(0,I),t[1,T]ϵϵθ(xt,tp,t)2,subscripttotal𝜆subscriptrewardsubscriptpreabsentlimit-from𝜆italic-ϕsubscript𝐑𝐜𝐚𝐩subscript𝑡𝑝subscript𝑔𝜃subscript𝑡𝑝subscript𝔼formulae-sequencesimilar-tosubscript𝑥0𝑞subscript𝑥0formulae-sequencesimilar-toitalic-ϵ𝒩0𝐼similar-to𝑡1𝑇superscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑥𝑡subscript𝑡𝑝𝑡2\begin{array}[]{l}\mathcal{L}_{\text{total}}=\lambda\mathcal{L}_{\text{reward}% }+\mathcal{L}_{\text{pre}}\\ =\lambda\phi({\bf{R}_{cap}}({t_{p}},{g_{\theta}}({t_{p}})))+\\ \mathbb{E}_{x_{0}\sim q\left(x_{0}\right),\epsilon\sim\mathcal{N}(0,I),t\sim[1% ,T]}\left\|\epsilon-\epsilon_{\theta}\left(x_{t},t_{p},t\right)\right\|^{2},% \end{array}start_ARRAY start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = italic_λ caligraphic_L start_POSTSUBSCRIPT reward end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT pre end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL = italic_λ italic_ϕ ( bold_R start_POSTSUBSCRIPT bold_cap end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ) ) + end_CELL end_ROW start_ROW start_CELL blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) , italic_t ∼ [ 1 , italic_T ] end_POSTSUBSCRIPT ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW end_ARRAY (5)

where λ𝜆\lambdaitalic_λ is a weighting factor, ϕitalic-ϕ\phiitalic_ϕ is the ReLU operation, 𝐑𝐜𝐚𝐩subscript𝐑𝐜𝐚𝐩\bf{R}_{cap}bold_R start_POSTSUBSCRIPT bold_cap end_POSTSUBSCRIPT denotes the caption reward score, θ𝜃\thetaitalic_θ represents the parameters of the text-to-image diffusion models, while gθ(tp)subscript𝑔𝜃subscript𝑡𝑝{g_{\theta}}({t_{p}})italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) denotes the generated image produced by the text-to-image diffusion models with parameters θ𝜃\thetaitalic_θ, corresponding to the text prompt tpsubscript𝑡𝑝t_{p}italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. This formulation underscores the essential role of the ReFL loss in optimizing the model’s performance with respect to semantic alignment.

Refer to caption
Figure 3: Input Image is aligned at a coarse level, focusing on objects. Output Image 1 illustrates the RAM process, also generating phases for text input. Text outputs 1 and 2 provide essential parameters (aligned attributes, weighted granularity) for the final generation, leading to output image 2 through fine-grained alignment.

III-B2 Fine Semantic Re-alignment

In this subsection, we present a training-free method for achieving fine-grained semantic re-alignment. Our objective is to accurately capture the attributes and relationships of the objects described in the input text. This method encompasses two key components: the local dense caption and re-weighting attention modulation.

Local Dense Caption Generation. The local dense caption generation module is designed to concentrate on the specific details within the generated images and assess their alignment with the provided text descriptions from a local perspective. This method fundamentally tackles two crucial objectives: 1) Ascertaining whether the objects depicted in the generated image are consistent with the textual descriptions. 2) providing comprehensive and precise captions for the objects depicted within the generated image.

Specifically, the local dense caption generation module first recognizes the objects in the previously generated images using an off-the-shelf image tagging model, i.e., Recognize Anything Model (RAM) [46]. Subsequently, given the provided prompt and the image tags, the large language model, i.e., GPT-4 [47], is utilized to assess the likelihood score {si}i=1nsuperscriptsubscriptsubscript𝑠𝑖𝑖1𝑛\{{s_{i}}\}_{i=1}^{n}{ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and provide the local detailed descriptions {li}i=1nsuperscriptsubscriptsubscript𝑙𝑖𝑖1𝑛\{{l_{i}}\}_{i=1}^{n}{ italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for each recognized object. The score of each object is assigned based on the likelihood of each category label of the object appearing in the scene, which can be summarized as follows:

si={2,c is certain to appear in the scene.0.5,c may appear in the scene.0,c is unlikely to appear in the scene.subscript𝑠𝑖cases2,missing-subexpressionc is certain to appear in the scene.0.5,missing-subexpressionc may appear in the scene.0,missing-subexpressionc is unlikely to appear in the scene.s_{i}=\left\{\begin{array}[]{rcl}2\text{,}&&\text{$c$ is certain to appear in % the scene.}\\ 0.5\text{,}&&\text{$c$ may appear in the scene.}\\ 0\text{,}&&\text{$c$ is unlikely to appear in the scene.}\\ \end{array}\right.italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL 2 , end_CELL start_CELL end_CELL start_CELL italic_c is certain to appear in the scene. end_CELL end_ROW start_ROW start_CELL 0.5 , end_CELL start_CELL end_CELL start_CELL italic_c may appear in the scene. end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL end_CELL start_CELL italic_c is unlikely to appear in the scene. end_CELL end_ROW end_ARRAY (6)

where c𝑐citalic_c denotes the category label of the object.

Consider the text prompt ‘a red book and a yellow pen’ as an example. We first utilize a fine-tuned text-to-image diffusion model to generate the image. Subsequently, RAM is employed to identify the objects within this image. If the object tag is ‘book’ or ‘pen’, GPT-4 determines that these objects are certain to appear in the scene, assigning a score of 2. Conversely, if the object tag is ‘sign’ or ‘banana’, GPT-4 deems these objects unlikely to appear, thus assigning a score of 0. For an object tag like ‘desk’, which GPT-4 considers as possibly appearing in the scene, a score of 0.5 is assigned. The example of using GPT-4 is shown in Figure 3. Our extensive experiments have demonstrated GPT-4’s proficiency in accurately performing such tasks.

By assigning scores in this way, we can obtain a likelihood score {si}i=1nsuperscriptsubscriptsubscript𝑠𝑖𝑖1𝑛\{{s_{i}}\}_{i=1}^{n}{ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for each recognized object. After obtaining the local dense caption and the corresponding likelihood score, we use the off-the-shelf segmentation model, i.e., Grounded Semantic Segmentation anything (Grounded-SAM)[48] to obtain the object masks {mi}i=1nsuperscriptsubscriptsubscript𝑚𝑖𝑖1𝑛\{{m_{i}}\}_{i=1}^{n}{ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. The local dense caption and the object mask can provide guidance regarding the attributes and layout of objects with the re-weighting attention modulation method.

Re-weighting Attention Modulation. The re-weighting attention strategy plays a crucial role in controlling how specific tokens influence the resulting image. By adjusting the influence based on reward scores, more relevant semantic cues can dominate the image generation process, improving the semantic alignment between the text and the generated image. Building on this concept, we propose a re-weighting attention modulation module. Given a set of detailed local caption {li}i=1nsuperscriptsubscriptsubscript𝑙𝑖𝑖1𝑛\{{l_{i}}\}_{i=1}^{n}{ italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and corresponding object masks {mi}i=1nsuperscriptsubscriptsubscript𝑚𝑖𝑖1𝑛\{{m_{i}}\}_{i=1}^{n}{ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the module ensures that objects appear in the correct regions based on their likelihood scores {si}i=1nsuperscriptsubscriptsubscript𝑠𝑖𝑖1𝑛\{{s_{i}}\}_{i=1}^{n}{ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT.

Specifically, the original attention maps A|queries|×|keys|A\in\mathbb{R}^{{\lvert queries\lvert}\times{\lvert keys\lvert}}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT | italic_q italic_u italic_e italic_r italic_i italic_e italic_s | × | italic_k italic_e italic_y italic_s | end_POSTSUPERSCRIPT is defined as below:

A=softmax(QKTd),𝐴softmax𝑄superscript𝐾𝑇𝑑A=\text{softmax}\left(\frac{QK^{T}}{\sqrt{d}}\right),italic_A = softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) , (7)

where Q𝑄Qitalic_Q represents the query value, which is mapped from image features, while K𝐾Kitalic_K denotes the key value, derived from text features. The term d𝑑ditalic_d is the length of the key and query features. Based on the attention map A𝐴Aitalic_A, the image features can be updated, referencing the text features.

Our re-weighting attention modulation method modulates the original attention maps A𝐴Aitalic_A as follows:

A=softmax(QKT+SMd),M=λtRMpos(1B)λt(1R)Mneg(1B),superscript𝐴softmax𝑄superscript𝐾𝑇direct-product𝑆𝑀𝑑𝑀direct-productsubscript𝜆𝑡𝑅subscript𝑀pos1𝐵direct-productsubscript𝜆𝑡1𝑅subscript𝑀neg1𝐵\begin{array}[]{l}A^{\prime}=\text{softmax}\left(\frac{QK^{T}+S\odot M}{\sqrt{% d}}\right),\\ M=\lambda_{t}\cdot R\odot M_{\text{pos}}\odot(1-B)\\ -\lambda_{t}\cdot(1-R)\odot M_{\text{neg}}\odot(1-B),\end{array}start_ARRAY start_ROW start_CELL italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_S ⊙ italic_M end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) , end_CELL end_ROW start_ROW start_CELL italic_M = italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_R ⊙ italic_M start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT ⊙ ( 1 - italic_B ) end_CELL end_ROW start_ROW start_CELL - italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ ( 1 - italic_R ) ⊙ italic_M start_POSTSUBSCRIPT neg end_POSTSUBSCRIPT ⊙ ( 1 - italic_B ) , end_CELL end_ROW end_ARRAY (8)

where direct-product\odot denotes the Hadamard product. λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a scalar, proportional to the timestep t𝑡titalic_t, to adjust the degree of modulation. S𝑆Sitalic_S represents the re-weighting score matrix, which can be obtained through the likelihood score {si}i=1nsuperscriptsubscriptsubscript𝑠𝑖𝑖1𝑛\{{s_{i}}\}_{i=1}^{n}{ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. The score matrix Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each i𝑖iitalic_i is constructed based on the elements of object masks {mi}i=1nsuperscriptsubscriptsubscript𝑚𝑖𝑖1𝑛\{{m_{i}}\}_{i=1}^{n}{ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and corresponding likelihood scores {si}i=1nsuperscriptsubscriptsubscript𝑠𝑖𝑖1𝑛\{{s_{i}}\}_{i=1}^{n}{ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. The rule for constructing Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is as follows:

Si=[Sijk]whereSijk={siif mijk>0,1if mijk0,formulae-sequencesubscript𝑆𝑖delimited-[]subscript𝑆𝑖𝑗𝑘wheresubscript𝑆𝑖𝑗𝑘casessubscript𝑠𝑖if subscript𝑚𝑖𝑗𝑘01if subscript𝑚𝑖𝑗𝑘0S_{i}=\left[S_{ijk}\right]\quad\text{where}\quad S_{ijk}=\begin{cases}s_{i}&% \text{if }m_{ijk}>0,\\ 1&\text{if }m_{ijk}\leq 0,\end{cases}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_S start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ] where italic_S start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT = { start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL if italic_m start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT > 0 , end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL if italic_m start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ≤ 0 , end_CELL end_ROW (9)

where j𝑗jitalic_j and k𝑘kitalic_k represent the row and column indices in Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT respectively. This formulation applies to each matrix Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the set {Si}i=1nsuperscriptsubscriptsubscript𝑆𝑖𝑖1𝑛\{S_{i}\}_{i=1}^{n}{ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. We use Grounded-SAM to obtain the object masks {mi}i=1nsuperscriptsubscriptsubscript𝑚𝑖𝑖1𝑛\{{m_{i}}\}_{i=1}^{n}{ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. R𝑅Ritalic_R is a boolean mask vector where each element corresponds to a token in the text features. Ri>0subscript𝑅𝑖0R_{i}>0italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 indicates that the text token is activated at position i𝑖iitalic_i. B=QKT𝐵𝑄superscript𝐾𝑇B=QK^{T}italic_B = italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denotes the similarity score between the query and key. Mpossubscript𝑀posM_{\text{pos}}italic_M start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT and Mnegsubscript𝑀negM_{\text{neg}}italic_M start_POSTSUBSCRIPT neg end_POSTSUBSCRIPT can be calculated as:

Mpos=max(QKT)QKT,Mneg=QKTmin(QKT),subscript𝑀pos𝑄superscript𝐾𝑇𝑄superscript𝐾𝑇subscript𝑀neg𝑄superscript𝐾𝑇𝑄superscript𝐾𝑇\begin{array}[]{l}M_{\text{pos}}=\max(QK^{T})-QK^{T},\\ M_{\text{neg}}=QK^{T}-\min(QK^{T}),\end{array}start_ARRAY start_ROW start_CELL italic_M start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT = roman_max ( italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) - italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_M start_POSTSUBSCRIPT neg end_POSTSUBSCRIPT = italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - roman_min ( italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) , end_CELL end_ROW end_ARRAY (10)

where Mpossubscript𝑀posM_{\text{pos}}italic_M start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT denotes the maximum and Mnegsubscript𝑀negM_{\text{neg}}italic_M start_POSTSUBSCRIPT neg end_POSTSUBSCRIPT represents minimum values. With our re-weighting attention modulation, the model is guided to focus more attention on the important tokens, corresponding to the generated local dense captions. Therefore, our method can refine the previously generated images for better semantic alignment with the text prompts.

IV Experiments

TABLE I: Quantitative comparison of different methods on the MS-COCO [12], ABC-6K [6], CC-500 [6], and ViLG-300 [49] datasets.
Dataset Method FID\downarrow CLIP\uparrow TIFA \uparrow Human Study
Alignment \uparrow Fidelity \uparrow
MS-COCO [12] SD-v1.5 [3] 13.7599 0.1626 0.78 8.9% 2.1%
SD-XL [3] 7.0864 0.3578 0.84 14.9% 19.8%
DeepFloyd-IF [50] 7.5431 0.3433 0.87 16.1% 16.7%
Imagereward [5] 12.7248 0.1587 0.74 7.5% 5.8%
DenseDiffusion [51] 8.3359 0.1585 0.80 8.7% 11.9%
Promptist [39] 8.0351 0.1627 0.79 11.2% 9.1%
StructureDiffusion [6] 8.7603 0.3279 0.85 13.6% 14.3%
PixArt-α𝛼\alphaitalic_α [7] 7.9378 0.3449 0.84 -- --
RealignDiff (Ours) 6.9617 0.3767 0.89 19.1% 20.3%
ABC-6K [6] SD-v1.5 [3] 13.4539 0.1620 0.75 9.2% 8.7%
SD-XL [3] 6.7145 0.3531 0.84 14.4% 18.4%
DeepFloyd-IF [50] 7.3319 0.3399 0.86 15.7% 15.0%
Imagereward [5] 12.4287 0.1592 0.71 9.8% 6.8%
DenseDiffusion [51] 8.1301 0.1604 0.79 7.1% 11.1%
Promptist [39] 8.0352 0.1636 0.77 11.6% 9.3%
StructureDiffusion [6] 8.6598 0.3301 0.83 12.8% 12.2%
PixArt-α𝛼\alphaitalic_α [7] 7.7364 0.3391 0.83 -- --
RealignDiff (Ours) 6.5623 0.3782 0.88 19.4% 18.5%
CC-500 [6] SD-v1.5 [3] 13.9598 0.1615 0.77 6.8% 4.0%
SD-XL [3] 7.2364 0.3498 0.83 16.5% 15.4%
DeepFloyd-IF [50] 7.5494 0.3614 0.85 20.1% 9.6%
Imagereward [5] 12.8246 0.1582 0.74 5.3% 1.9%
DenseDiffusion [51] 8.1353 0.1635 0.82 10.1% 18.0%
Promptist [39] 8.4352 0.1593 0.77 7.3% 14.4%
StructureDiffusion [6] 8.8604 0.3281 0.85 13.1% 12.3%
PixArt-α𝛼\alphaitalic_α [7] 8.5431 0.3211 0.82 -- --
RealignDiff (Ours) 7.1641 0.3761 0.90 20.8% 24.4%
ViLG-300 [49] SD-v1.5 [3] 15.4943 0.1957 0.76 10.1% 2.4%
SD-XL [3] 7.9753 0.4213 0.83 14.9% 20.0%
DeepFloyd-IF [50] 8.1541 0.4349 0.86 16.1% 16.7%
Imagereward [5] 13.6488 0.1906 0.75 7.5% 4.8%
DenseDiffusion [51] 8.6412 0.1907 0.81 8.7% 11.9%
Promptist [39] 9.2419 0.1962 0.77 11.2% 9.5%
StructureDiffusion [6] 9.1514 0.3936 0.86 13.6% 14.3%
PixArt-α𝛼\alphaitalic_α [7] 8.4496 0.4015 0.84 -- --
RealignDiff (Ours) 7.2311 0.4527 0.90 17.9% 20.4%

IV-A Experimental Setup

Datasets. Our approach is trained on the MS-COCO [12] and ViLG-300 [13] datasets. The MS-COCO dataset comprises 82,783 training and 40,504 validation text-image pairs. We split the ViLG300 dataset into 80% for the training set and 20% for the test set. It is noteworthy that only the image captions from the training subset of the MS-COCO dataset and the ViLG-300 dataset are utilized for fine-tuning the model. For the evaluation, we have randomly selected 5,000 image captions from the validation subset of the MS-COCO dataset. Furthermore, our approach is also evaluated on the ABC-6K [6] and CC-500 [6]. The ABC-6K, derived from natural prompts within MS-COCO, each contains a minimum of two color descriptors modifying distinct objects. In contrast, the CC-500 consists of natural compositional prompts but primarily features simpler prompts that combine two concepts. These prompts follow sentence structures like “a red car and a pink elephant”, pairing objects with their respective attribute descriptors.

Evaluation Metrics. We adopt two metrics to measure the semantic consistency between the generated images and the input text prompts: CLIP [8, 52]and TIFA [53] scores. The higher the CLIP and TIFA scores, the better the semantic consistency. TIFA score uses a VQA method to evaluate alignment. Furthermore, the quality of generated images was assessed using the Frechet Inception Distance (FID)222https://github.com/mseitzer/pytorch-fid [54], where a lower FID score indicates better image quality. We also conducted a human study to gauge the semantic alignment of the generated images with their corresponding textual prompts. Participants in the study were presented with sets of images synthesized by the different text-to-image diffusion models alongside the input prompts that guided their generation. They were instructed to choose the order of different results in terms of the alignment and fidelity metrics. The alignment score measures the semantic consistency between the generated images and the input prompts. The fidelity score measures the quality of the generated images. We use the average rank from different participants as the final scores. This study collected a total of 100100100100 human evaluation results.

Implementation Details. Our algorithm is implemented in PyTorch. All experiments are conducted on servers equipped with eight Nvidia A100 GPUs, each with 40 GB of memory, and an AMD EPYC 7742 CPU running at 2.30 GHz. We adopt the Stable Diffusion v1.5 [3] as the foundational generative model and proceed to fine-tune it. We set a learning rate at 1e-5 and utilize a cumulative batch size of 128. All training and evaluations are conducted at a resolution of 512x512. We chose our fine-tuned checkpoint based on early stopping criteria to avoid overfitting and ‘reward hacking’, leading to performance degradation. The training was stopped after approximately 947 iterations when further improvements on validation metrics ceased. For each generation task, we set the random seed to 42 and generate images with a resolution of 512x512 pixels. The model is fine-tuned using half-precision floating-point numbers. For the ReFL algorithm, we configure the settings with λ=1e3𝜆1𝑒3\lambda=1e-3italic_λ = 1 italic_e - 3, and T=50𝑇50T=50italic_T = 50.

IV-B Comparison Against Baselines

In this section, we conduct a comparative assessment of the proposed RealignDiff model against eight state-of-the-art text-to-image diffusion models. These include SD-v1.5 [3], SD-XL [3], DeepFloyd-IF [50], PixArt-α𝛼\alphaitalic_α [7], DenseDiffusion [51], Imagereward [5], Promptist [39], and StructureDiffusion [6]. Table I displays the quantitative comparison results of different methods on the MS-COCO, ABC-6K, CC-500 and ViLG-300 datasets.

Refer to caption
Figure 4: Qualitative comparison of different methods. Our method achieves the best performance regarding the quantity of objects, leakage of attributes, and the binding of attributes. More cases are provided in the Appendix.
Refer to caption
(a) A cat laying next to a stainless steel bowl.
Refer to caption
(b) A bench near a grassy area near a parked car.
Refer to caption
(c) A dog watching a little boy on TV.
Refer to caption
(d) A photo of a stop sign in front of a graffitied truck
Figure 5: From left to right, respectively: RealignDiff(ours), SD-v1.5 [3], DenseDiffusion [51], Imagereward [5], Promptist [39], StructureDiffusion [6], SD-XL [3] and PixArt-α𝛼\alphaitalic_α [7]

As shown in Table I, RealignDiff demonstrates superior performance over the other state-of-the-art (SOTA) methods across all evaluated metrics. It is remarkable that our method’s performance slightly surpasses SDXL, even though we used SD v1.5 as the foundational generative model. Specifically, it achieved an FID score of 6.9617 on the MS-COCO dataset, which is significantly lower than those of competing methods, indicating its effectiveness. Additionally, in terms of CLIP and TIFA scores, our method reached 0.3767 and 0.89 on the MS-COCO dataset, respectively. These scores further underscore the ability of RealignDiff to generate semantically coherent and visually compelling images from textual descriptions.

Figures 1, 4 and 5 illustrate the qualitative comparison among different methods. It is evident that our RealignDiff achieves the best results in terms of both image quality and semantic consistency between the generated images and the text prompts. ImageReward, Promptist, and StructureDiffusion fail to depict all main objects; SD-v1.5, Midjourney, and DenseDiffusion exhibit misaligned attributes of the objects, such as color.

Figure 6 presents the generation comparison results for complex prompts. Compared to SDXL, our RealignDiff can accurately capture the attributes of objects, such as ”black and white cat”, ”golden retriever”, ”sunlight”, ”window”, and ”wooden”. Additionally, our method can correctly handle the relative positions of multiple objects, such as ”to the right”.

Refer to caption
Refer to caption
Figure 6: Generated Image for the complex prompt. Left: SDXL, Right: RealignDiff (Ours)

IV-C Ablation Study

In this section, we first study the effectiveness of the proposed coarse semantic re-alignment and fine semantic re-alignment modules. We then discuss the advantages of our proposed caption reward. Following this, we present a comparison of different LLMs. Finally, we showcase some intermediate generative results and attention maps to illustrate the performance of our approach.

Refer to caption
Figure 7: Effectiveness of the coarse and fine semantic re-alignment modules. a) a black cat laying on top of the arm of a red chair. b) a girl in a yellow dress and a big white animal. c) a yellow bird is sitting on a park bench. d) a little girl in a khaki shirt and pink skirt playing tennis. e) a street light and a blue pole against a white background

Coarse & Fine Semantic Re-alignment. Table II presents the results of the ablation study for our coarse and fine semantic re-alignment modules on the MS-COCO dataset. The results demonstrate that, in terms of image quality, the coarse semantic re-alignment module reduces the FID from 13.7599 to 7.5349, and the fine semantic re-alignment module further reduces it to 6.9617. With respect to the semantic consistency between the generated images and the input texts, the coarse semantic re-alignment module improves the CLIP score from 0.1626 to 0.2548, and the fine semantic re-alignment module further improves it to 0.3767. These findings suggest that both the coarse and fine semantic re-alignment modules significantly enhance the performance of text-to-image diffusion models.

TABLE II: Ablation study of our coarse and fine semantic re-alignment modules on the MS-COCO dataset.
Coarse Fine FID\downarrow CLIP\uparrow TIFA \uparrow
13.7599 0.1626 0.75
7.5349 0.2548 0.85
9.3622 0.1371 0.78
6.9617 0.3767 0.89

Figure 7 further underscores the effectiveness of the coarse and fine semantic re-alignment modules. The figure reveals that without the coarse semantic re-alignment, the text-to-image diffusion model often fails to capture the main objects mentioned in the text prompts. Without the fine semantic re-alignment, the model struggles to accurately represent the attributes and relationships of the objects described in the input text. However, when both the coarse and fine semantic re-alignment modules are applied, the text-to-image diffusion model is capable of generating high-quality images that are semantically aligned with the input texts.

Caption Reward. The reward function is a pivotal component in the coarse semantic re-alignment stage. In this subsection, we evaluate our proposed CaptionReward against other reward functions such as CLIP reward, BLIP reward, and ImageReward. Table III provides the comparative results among these reward functions on the MS-COCO dataset

TABLE III: Effectiveness of caption reward.
Reward Function FID\downarrow CLIP\uparrow TIFA \uparrow
Clip Reward [52] 14.3091 0.1401 0.77
Blip Reward [9] 13.2098 0.1400 0.76
Image Reward [5] 12.7248 0.1587 0.78
Caption Reward (Ours) 6.9617 0.3767 0.89
Refer to caption
Figure 8: Qualitative Comparison among Reward functions. a) a black cat laying on top of the arm of a red chair. b) a girl in a yellow dress and a big white animal. c) a yellow bird is sitting on a park bench. d) a little girl in a khaki shirt and pink skirt playing tennis. e) a street light and a blue pole against a white background

As shown in Table III, our novel Caption Reward outperforms all other reward functions in all metrics, which include CLIP Reward, BLIP Reward, and Image Reward. This superiority can be attributed to CaptionReward’s methodology of calculating the reward score by measuring the similarity between the generated caption and the input prompt, rather than measuring the similarity between the generated image and the input prompt, which is the approach taken by the other rewards. The detailed caption provides more nuanced guidance on the appropriateness of the surrounding concepts and context within the image with respect to the given text prompt. Additionally, Figure 8 further demonstrates the effectiveness of the proposed Caption Reward. As depicted in Figure 8, the text-to-image diffusion model, when re-aligned using the Caption Reward, is capable of generating images that are not only of higher quality but also more semantically aligned with the input text than those produced using other reward functions.

Comparison of different LLMs. The efficacy of the fine semantic re-alignment module is intrinsically linked to its ability to tag images accurately and modulate attributes effectively through large language models (LLMs). This section aims to delve into an ablation study that analyzes the success rates of various image tagging models and LLMs on the ViLG-300 dataset, including ChatGPT, GPT-4 [55], Vicuna-7b [56], and Llama2-7b [57].

TABLE IV: The success rates of different image tagging and large language models in the local dense caption generation module on the ViLG-300 dataset.
Model Llama2-7b Vicuna-7b ChatGPT GPT-4
RAM [46] 63% 81% 92% 99%
Tag2Text [58] 51% 76% 85% 91%

Table IV showcases the success rates of both RAM and Tag2Text when paired with the aforementioned LLMs. It is noteworthy that the success rate of RAM+GPT-4 can achieve 99%. The findings indicate: 1) ChatGPT, while slightly lagging behind GPT-4, presents promising outcomes, especially with RAM. This underscores the versatility and robustness of the ChatGPT model; 2) On the other end of the spectrum, Llama2-7b exhibits the lowest success rates with both tagging models. This could hint at possible areas of refinement or potential incompatibilities between the tag model and LLM. In the future, we can improve performance and save costs by specifying fine-tuning of the task.

Intermediate generative results and attention maps. Figure 9 presents the intermediate generative results of RealignDiff, displaying the progression from left to right: the initial coarse images following coarse semantic re-alignment, the segmented maps derived from the coarse images, and the final high-quality, semantically-aligned images produced after fine semantic re-alignment, which achieve fine-grained attribute binding. In Figure 9(a), the vase undergoes a reassignment of its attribute to yellow. In Figure 9(b), the quantity of cats is reassigned to one, their location is redefined as on the bowl, and their color is reassigned to a black-and-white pattern.

Refer to caption
Figure 9: Intermediate generative results of RealignDiff.

We use Diffusion Attentive Attribution Maps (DAAM) [59] to visualize the intermediate attention maps of key attribute tokens before and after fine semantic re-alignment. As displayed in Figure 10, the color attributes black and blue get more attention in the designated regions after fine semantic re-alignment, leading to better alignment and generative performance.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 10: Intermediate attention maps of ”A black dog on the street next to a blue car.” (a) ’black’ before re-align, (b) ’black’ after re-align, (c) ’blue’ before re-align, (d) ’blue’ after re-align

V Conclusion

In this paper, we propose a novel two-stage coarse-to-fine semantic re-alignment method, RealignDiff, to enhance the alignment between descriptions and corresponding images within the text-to-image diffusion models. The initial coarse semantic re-alignment stage entails fine-tuning the text-to-image model from a global semantic perspective. This stage is crucial for ensuring that the generated images faithfully depict the objects and entities described within the given textual input. The fine semantic re-alignment stage occurs without the need for additional training data, allowing for the accurate capture of object attributes and relationships. Experimental results on MS-COCO and ViLG-300 datasets demonstrate that RealignDiff outperforms other baselines in terms of both visual quality and semantic similarity with input prompt.

Limitations and future works In the fine semantic re-alignment stage, if the large language model fails to provide accurate intermediate results, it may hinder the refinement of previously generated images. Future work will focus on overcoming this limitation. Additionally, we aim to explore dynamic learning from multiple reward functions, such as semantic and aesthetic, within the diffusion model.

References

  • [1] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 479–36 494, 2022.
  • [2] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
  • [3] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2022, pp. 10 684–10 695.
  • [4] X. Xu, Z. Wang, E. Zhang, K. Wang, and H. Shi, “Versatile diffusion: Text, images and variations all in one diffusion model,” arXiv preprint arXiv:2211.08332, 2022.
  • [5] J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, and Y. Dong, “Imagereward: Learning and evaluating human preferences for text-to-image generation,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [6] W. Feng, X. He, T.-J. Fu, V. Jampani, A. R. Akula, P. Narayana, S. Basu, X. E. Wang, and W. Y. Wang, “Training-free structured diffusion guidance for compositional text-to-image synthesis,” in The Eleventh International Conference on Learning Representations, 2023.
  • [7] J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, and Z. Li, “Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis,” in The Twelve International Conference on Learning Representations, 2024.
  • [8] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  • [9] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning.   PMLR, 2022, pp. 12 888–12 900.
  • [10] G. Kim, T. Kwon, and J. C. Ye, “Diffusionclip: Text-guided diffusion models for robust image manipulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2426–2435.
  • [11] B. Li, X. Wang, X. Xu, Y. Hou, Y. Feng, F. Wang, and W. Che, “Semantic-guided image augmentation with pre-trained models,” arXiv preprint arXiv:2302.02070, 2023.
  • [12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.
  • [13] Z. Feng, Z. Zhang, X. Yu, Y. Fang, L. Li, X. Chen, Y. Lu, J. Liu, W. Yin, S. Feng et al., “Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 135–10 145.
  • [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
  • [15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention Is All You Need,” in Neural Information Processing Systems, 2017.
  • [16] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.
  • [17] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in International Conference on Machine Learning.   PMLR, 2016, pp. 1060–1069.
  • [18] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “Stackgan++: Realistic image synthesis with stacked generative adversarial networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, pp. 1947–1962, 2018.
  • [19] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-grained text to image generation with attentional generative adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1316–1324.
  • [20] B. Li, X. Qi, T. Lukasiewicz, and P. H. Torr, “Controllable text-to-image generation,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, pp. 2065–2075.
  • [21] H. Dong, S. Yu, C. Wu, and Y. Guo, “Semantic image synthesis via adversarial learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5706–5714.
  • [22] M. Zhu, P. Pan, W. Chen, and Y. Yang, “Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5802–5810.
  • [23] M. Tao, H. Tang, S. Wu, N. Sebe, X.-Y. Jing, F. Wu, and B. Bao, “Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis,” arXiv preprint arXiv:2008.05865, 2020.
  • [24] H. Ye, X. Yang, M. Takac, R. Sunderraman, and S. Ji, “Improving text-to-image synthesis using contrastive learning,” arXiv preprint arXiv:2107.02423, 2021.
  • [25] M. Kang, J.-Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, and T. Park, “Scaling up gans for text-to-image synthesis,” arXiv preprint arXiv:2303.05511, 2023.
  • [26] A. Sauer, T. Karras, S. Laine, A. Geiger, and T. Aila, “Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis,” arXiv preprint arXiv:2301.09515, 2023.
  • [27] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8821–8831.
  • [28] M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang et al., “Cogview: Mastering text-to-image generation via transformers,” Advances in Neural Information Processing Systems, vol. 34, pp. 19 822–19 835, 2021.
  • [29] P. Esser, R. Rombach, A. Blattmann, and B. Ommer, “Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis,” Advances in Neural Information Processing Systems, vol. 34, pp. 3518–3532, 2021.
  • [30] M. Ding, W. Zheng, W. Hong, and J. Tang, “Cogview2: Faster and better text-to-image generation via hierarchical transformers,” arXiv preprint arXiv:2204.14217, 2022.
  • [31] Z. Zhang, J. Ma, C. Zhou, R. Men, Z. Li, M. Ding, J. Tang, J. Zhou, and H. Yang, “M6-ufc: Unifying multi-modal controls for conditional image synthesis,” arXiv preprint arXiv:2105.14211, 2021.
  • [32] D. Lee, C. Kim, S. Kim, M. Cho, and W.-S. Han, “Autoregressive image generation using residual quantization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 523–11 532.
  • [33] H. Chang, H. Zhang, J. Barber, A. Maschinot, J. Lezama, L. Jiang, M.-H. Yang, K. Murphy, W. T. Freeman, M. Rubinstein et al., “Muse: Text-to-image generation via masked generative transformers,” arXiv preprint arXiv:2301.00704, 2023.
  • [34] A. Van Den Oord, O. Vinyals et al., “Neural discrete representation learning,” Advances in neural information processing systems, vol. 30, 2017.
  • [35] A. Razavi, A. van den Oord, and O. Vinyals, “Generating diverse high-fidelity images with vq-vae-2,” in Advances in neural information processing systems, 2019, pp. 14 866–14 876.
  • [36] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 873–12 883.
  • [37] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021.
  • [38] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, “Cascaded diffusion models for high fidelity image generation.” J. Mach. Learn. Res., vol. 23, pp. 47–1, 2022.
  • [39] Y. Hao, Z. Chi, L. Dong, and F. Wei, “Optimizing prompts for text-to-image generation,” arXiv preprint arXiv:2212.09611, 2022.
  • [40] K. Lee, H. Liu, M. Ryu, O. Watkins, Y. Du, C. Boutilier, P. Abbeel, M. Ghavamzadeh, and S. S. Gu, “Aligning text-to-image models using human feedback,” arXiv preprint arXiv:2302.12192, 2023.
  • [41] X. Wu, K. Sun, F. Zhu, R. Zhao, and H. Li, “Better aligning text-to-image models with human preference,” arXiv preprint arXiv:2303.14420, 2023.
  • [42] H. Dong, W. Xiong, D. Goyal, R. Pan, S. Diao, J. Zhang, K. Shum, and T. Zhang, “Raft: Reward ranked finetuning for generative foundation model alignment,” arXiv preprint arXiv:2304.06767, 2023.
  • [43] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
  • [44] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18.   Springer, 2015, pp. 234–241.
  • [45] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
  • [46] Y. Zhang, X. Huang, J. Ma, Z. Li, Z. Luo, Y. Xie, Y. Qin, T. Luo, Y. Li, S. Liu et al., “Recognize anything: A strong image tagging model,” arXiv preprint arXiv:2306.03514, 2023.
  • [47] OpenAI, “Gpt-4 technical report,” arXiv:2303.08774, 2023.
  • [48] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023.
  • [49] Z. Feng, Z. Zhang, X. Yu, Y. Fang, L. Li, X. Chen, Y. Lu, J. Liu, W. Yin, S. Feng, Y. Sun, L. Chen, H. Tian, H. Wu, and H. Wang, “Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts,” 2023.
  • [50] A. Shonenkov, M. Konstantinov, D. Bakshandaeva, C. Schuhmann, K. Ivanova, and N. Klokova, “Deepfloyd if: A powerful text-to-image model that can smartly integrate text into images,” 2023, online; accessed 16-November-2023. [Online]. Available: https://www.deepfloyd.ai/deepfloyd-if
  • [51] Y. Kim, J. Lee, J.-H. Kim, J.-W. Ha, and J.-Y. Zhu, “Dense text-to-image generation with attention modulation,” arXiv preprint arXiv:2308.12964, 2023.
  • [52] J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi, “Clipscore: A reference-free evaluation metric for image captioning,” 2022.
  • [53] Y. Hu, B. Liu, J. Kasai, Y. Wang, M. Ostendorf, R. Krishna, and N. A. Smith, “Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering,” arXiv preprint arXiv:2303.11897, 2023.
  • [54] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” 2018.
  • [55] OpenAI, “Gpt-4 technical report,” 2024.
  • [56] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” March 2023.
  • [57] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  • [58] X. Huang, Y. Zhang, J. Ma, W. Tian, R. Feng, Y. Zhang, Y. Li, Y. Guo, and L. Zhang, “Tag2text: Guiding vision-language model via image tagging,” 2023.
  • [59] R. Tang, L. Liu, A. Pandey, Z. Jiang, G. Yang, K. Kumar, P. Stenetorp, J. Lin, and F. Ture, “What the daam: Interpreting stable diffusion using cross attention,” arXiv preprint arXiv:2210.04885, 2022.