RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment
Abstract
Recent advances in text-to-image diffusion models have achieved remarkable success in generating high-quality, realistic images from textual descriptions. However, these approaches have faced challenges in precisely aligning the generated visual content with the textual concepts described in the prompts. In this paper, we propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff, aimed at improving the alignment between text and images in text-to-image diffusion models. In the coarse semantic re-alignment phase, a novel caption reward, leveraging the BLIP-2 model, is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt. Subsequently, the fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view. Experimental results on the MS-COCO and ViLG-300 datasets demonstrate that the proposed two-stage coarse-to-fine semantic re-alignment method outperforms other baseline re-alignment techniques by a substantial margin in both visual quality and semantic similarity with the input prompt.
Index Terms:
Text-to-Image Generation, Diffusion Model, Fine Semantic Re-alignmentI Introduction
Text-to-image diffusion models [1, 2, 3, 4] have witnessed significant advancements in recent years. These models can generate high-quality and diverse images based on the given input texts. The ability to convert textual descriptions into realistic images has enormous potential in various applications such as graphic design, computer vision, and creative writing. Despite several text-to-image diffusion models have been deployed in real-world applications such as Imagen [1], DALL-E 2 [2], Stable Diffusion [3], Midjourney111https://www.midjourney.com/, and Versatile Diffusion [4], the generated images from these models are not perfect [5] as displayed in Figure 1. The main challenge faced by existing text-to-image diffusion models is achieving precise alignment between the generated image and the input caption. Specifically, these models often encounter difficulties in accurately capturing the attributes and relationships of the objects described in the input text.
To tackle the issue of semantic misalignment in text-to-image diffusion models, some researchers have introduced pre-trained image-text models, such as CLIP [8] and BLIP [9], to calculate the semantic guidance [10, 11]. Recently, ImageReward [5] has also been proposed to solve both the text-image alignment and human aesthetic problems, which is trained using a comprehensive collection and annotation pipeline that leverages expert preference data. However, as shown in Figure 1, while ImageReward is capable of aligning images and text at a coarse level, it tends to overlook fine-grained alignment which is crucial for text-matched image generation, such as precisely aligning attributes, quantities, and relationships between objects described in the given text. To solve the fine-grained semantic alignment problem, Structure Diffusion [6] incorporates language structures into the cross-attention layers. While Structure Diffusion can address the attribute binding problem, it tends to miss the main objects described in the input prompt.
In this paper, we present a novel two-stage coarse-to-fine semantic re-alignment method, called RealignDiff, to generate images that more accurately align with user-provided textual descriptions within text-to-image diffusion models. During the coarse semantic re-alignment stage, we propose a novel caption reward to optimize the text-to-image diffusion model from a global semantic view. Specifically, the caption reward generates a corresponding detailed caption that depicts all crucial contents in the synthetic image via a BLIP-2 model and then calculates the reward score by measuring the similarity between the generated caption and the given prompt. The elaborated caption can give more guidance about whether the surrounding concepts and context in the image are reasonable given the input text prompt. It is noteworthy that only the coarse semantic re-alignment stage may not be sufficient to capture all the desired characteristics of the generated images, especially in cases where the input texts describe complex and diverse scenes. To sense the correctness of local semantic parts, we further propose the fine semantic re-alignment. In the fine semantic re-alignment stage, we present a local dense caption generation module and a re-weighting attention modulation module from the local semantic view to refine the previously generated images. The local dense caption generation module generates the mask, detailed caption, and the corresponding likelihood score of each object appearing in the generated images. Armed with the generated detailed captions and the corresponding scores, the re-weighting attention modulation module can re-align the generated captions and the segmented parts of the generated images.
Experimental results on the MS-COCO and ViLG-300 datasets demonstrate that the proposed two-stage coarse-to-fine semantic re-alignment method outperforms other baseline re-alignment techniques by a substantial margin in both visual quality and semantic similarity with the input prompt. Our approach opens up new avenues for research in this exciting field by providing a more accurate and precise alignment mechanism that can better capture the semantic meaning of the input text and generate high-quality images. Our main contributions are summarized as follows:
-
•
We propose a two-stage coarse-to-fine semantic re-alignment method for text-to-image diffusion models. The coarse semantic re-alignment stage ensures that the objects described in the given text appear in the generated images. The fine semantic re-alignment stage accurately captures the attributes and relationships of the objects in the input text.
-
•
We propose a novel caption reward and a novel local dense caption generation module. The caption reward measures the similarity between the generated caption and the given text prompt. The local dense caption generation module provides guidance regarding the attributes and spatial arrangements of objects.
- •
II Related Work
II-A Text-to-Image Generation.
Text-to-image generation aims to generate images given input text descriptions. Along with the progress on generative models, including generative adversarial networks (GANs [14]), auto-regressive model [15] and diffusion model [16], there are numbers of works for text-to-image generation. Among them, GANs are first adopted for text-to-image generation [17] and later many GAN-based models are proposed for better visual fidelity and caption similarity [18, 19, 20, 21, 22, 23, 24, 25, 26]. However, GANs suffer from the well-known problem of mode-collapse and unstable training processes. To solve these problems, another line of works explore applying Transformer-based auto-regressive model for text-to-image generation [27, 28, 29, 30, 31, 32, 33] with a discrete VAE [34, 35, 36] model for tokenizing the input images and a Transformer [15] model for fitting the joint distribution of text tokens and image tokens. Recent works adopt diffusion model for text-to-image generation [37, 38, 2, 1, 3, 4], which learns to predict the added noise of noised images and generates images from pure noise by iteratively predict added noise and remove it. Among them, in order to reduce the computational overhead of large-scale text-to-image generation models, Stable Diffusion [3] proposed to first encode the input images as low-dimension latent codes and then adopt a diffusion model to generate these latent codes conditioned on the input texts. Although significant progress in high-quality text-to-image generation has been achieved, problems including misalignment with human preference and misalignment with input texts still remain to be solved.
II-B Alignment of Text-to-Image Generation Models.
Some works [39, 40, 41, 5, 42] are proposed to align a text-to-image generation model with human preference and aesthetic quality. [40] first learn a reward model with the human feedback assessing model outputs and then finetune a text-to-image model by maximizing reward-weighted likelihood to improve image-text alignment. Similarly, [41] took the human aesthetic preference into account and proposed to learn a human preference reward model. ImageReward [5] proposed a general-purpose text-to-image human preference reward model, covering text-image alignment, body problems, human aesthetics, toxicity, and biases. Promptist [39] proposed prompt adaptation, i.e., training a language model to generate a better prompt given the origin prompt. They utilize the CLIP model and aesthetic predictor model as the reward model and perform supervised fine-tuning under the reinforcement learning paradigm. On the other hand, to circumvent the problems of inefficiencies and instabilities of Reinforcement Learning from Human Feedback (RLHF [43]), [42] introduce Reward ranked Fine-Tuning (RAFT) to align generative models more effectively. However, RAFT is prone to overfitting as the number of iterations increases. More recently, Xu et al. [5] have developed reward feedback learning (ReFL) to optimize text-to-image diffusion models against a reward function, which has demonstrated its effectiveness in achieving better alignment. However, the existing reward models do not take both coarse-grained and fine-grained image-text semantic alignment into account. In this paper, we propose RealignDiff to improve the alignment between text and images in text-to-image diffusion models from the global and local views.
III Method
Figure 2 shows the pipeline of our RealignDiff approach for boosting the text-to-image diffusion models. In this section, we first introduce the preliminary knowledge of the text-to-image diffusion model. Then we present the coarse semantic re-alignment method, including the caption reward and reward feedback learning framework. Finally, we introduce the fine semantic re-alignment method, including the local dense caption generation and the re-weighting attention modulation modules.
III-A Preliminary: Text-to-image diffusion model
Given an image sampled from the real image distribution , diffusion models first produce a Markov chain of latent variables by progressively adding Gaussian noise to the image according to some variance schedule given by as follows:
(1) |
and then learn a model to approximate the true posterior:
(2) |
to perform the reverse denoising process for image generation: starting from a random noise and gradually reducing the noise to finally get a real image . While a tractable variational lower-bound on can be used to optimize and , to achieve better results, [16] instead adopt a denoising network which predicts the added noise of a noisy image and adopts the following training objective:
(3) |
where is uniformly sampled from . For a text-to-image generation, the denoising network receives the input text as extra conditional input and is denoted as .
We adopt Stable Diffusion [3] as our baseline text-to-image generation model. In this model, a real image is first down-sampled 8 times as a lower-dimension latent code with an autoencoder model and the denoising network is parameterized as a Unet [44] network, where embedding of time step is injected with adaptive normalization layers and embedding of input text is injected with cross-attention layers. However, the Stable Diffusion model fails to perform precise alignment between the text concept and generated images since it’s trained only with the global alignment between the text and images.
III-B Coarse-to-fine Semantic Re-alignment
In this subsection, we first introduce the coarse semantic re-alignment method and then present the fine semantic re-alignment method.
III-B1 Coarse Semantic Re-alignment.
To ensure the objects described in the given text appear in the generated image, we propose the coarse semantic re-alignment method, including the caption reward and the reward feedback learning framework.
Caption Reward. The caption reward is proposed to improve the consistency between the synthetic caption of the generated image and the given text prompt. Specifically, given an image generated by a text-to-image diffusion model, we first obtain the corresponding caption using the pre-trained Blip-2 model. Then we compute the similarity between the embeddings of generated caption and the corresponding text prompt as our caption reward score. Note that we utilize a pre-trained BLIP-2 [45] text encoder to convert the captions into the text embeddings. Formally, the caption reward score can be calculated as follows:
(4) |
Note that while the caption reward can effectively promote consistency between the generated captions and input prompts, it may not capture all the desired characteristics of the generated images, especially in cases where the input texts describe complex and diverse scenes.
Reward Feedback Learning. Reward Feedback Learning (ReFL) is designed to optimize text-to-image diffusion models by leveraging a reward function. Within this framework, a caption reward is incorporated to enable coarse semantic re-alignment. ReFL facilitates the direct optimization of text-to-image diffusion models by back-propagating gradients to a randomly selected intermediate step during the denoising process. The rationale behind this random selection of is significant: solely retaining the gradient information from the last denoising step leads to pronounced training instability and suboptimal results. Instead of progressively reducing noise to generate an image from an intermediate state via a sequential transformation process , ReFL employs an alternative approach. It directly predicts from using the transformation during the fine-tuning of text-to-image diffusion models. This method is grounded in the insightful observation that the caption reward scores for generations after a sufficient number of denoising steps (typically, ), provide effective feedback for improving model performance.
To address the challenges of rapid overfitting and to enhance stability during fine-tuning, a re-weighting scheme is applied to the ReFL loss, along with regularization using the pre-training loss. The overall loss function is defined as:
(5) |
where is a weighting factor, is the ReLU operation, denotes the caption reward score, represents the parameters of the text-to-image diffusion models, while denotes the generated image produced by the text-to-image diffusion models with parameters , corresponding to the text prompt . This formulation underscores the essential role of the ReFL loss in optimizing the model’s performance with respect to semantic alignment.
III-B2 Fine Semantic Re-alignment
In this subsection, we present a training-free method for achieving fine-grained semantic re-alignment. Our objective is to accurately capture the attributes and relationships of the objects described in the input text. This method encompasses two key components: the local dense caption and re-weighting attention modulation.
Local Dense Caption Generation. The local dense caption generation module is designed to concentrate on the specific details within the generated images and assess their alignment with the provided text descriptions from a local perspective. This method fundamentally tackles two crucial objectives: 1) Ascertaining whether the objects depicted in the generated image are consistent with the textual descriptions. 2) providing comprehensive and precise captions for the objects depicted within the generated image.
Specifically, the local dense caption generation module first recognizes the objects in the previously generated images using an off-the-shelf image tagging model, i.e., Recognize Anything Model (RAM) [46]. Subsequently, given the provided prompt and the image tags, the large language model, i.e., GPT-4 [47], is utilized to assess the likelihood score and provide the local detailed descriptions for each recognized object. The score of each object is assigned based on the likelihood of each category label of the object appearing in the scene, which can be summarized as follows:
(6) |
where denotes the category label of the object.
Consider the text prompt ‘a red book and a yellow pen’ as an example. We first utilize a fine-tuned text-to-image diffusion model to generate the image. Subsequently, RAM is employed to identify the objects within this image. If the object tag is ‘book’ or ‘pen’, GPT-4 determines that these objects are certain to appear in the scene, assigning a score of 2. Conversely, if the object tag is ‘sign’ or ‘banana’, GPT-4 deems these objects unlikely to appear, thus assigning a score of 0. For an object tag like ‘desk’, which GPT-4 considers as possibly appearing in the scene, a score of 0.5 is assigned. The example of using GPT-4 is shown in Figure 3. Our extensive experiments have demonstrated GPT-4’s proficiency in accurately performing such tasks.
By assigning scores in this way, we can obtain a likelihood score for each recognized object. After obtaining the local dense caption and the corresponding likelihood score, we use the off-the-shelf segmentation model, i.e., Grounded Semantic Segmentation anything (Grounded-SAM)[48] to obtain the object masks . The local dense caption and the object mask can provide guidance regarding the attributes and layout of objects with the re-weighting attention modulation method.
Re-weighting Attention Modulation. The re-weighting attention strategy plays a crucial role in controlling how specific tokens influence the resulting image. By adjusting the influence based on reward scores, more relevant semantic cues can dominate the image generation process, improving the semantic alignment between the text and the generated image. Building on this concept, we propose a re-weighting attention modulation module. Given a set of detailed local caption and corresponding object masks , the module ensures that objects appear in the correct regions based on their likelihood scores .
Specifically, the original attention maps is defined as below:
(7) |
where represents the query value, which is mapped from image features, while denotes the key value, derived from text features. The term is the length of the key and query features. Based on the attention map , the image features can be updated, referencing the text features.
Our re-weighting attention modulation method modulates the original attention maps as follows:
(8) |
where denotes the Hadamard product. is a scalar, proportional to the timestep , to adjust the degree of modulation. represents the re-weighting score matrix, which can be obtained through the likelihood score . The score matrix for each is constructed based on the elements of object masks and corresponding likelihood scores . The rule for constructing is as follows:
(9) |
where and represent the row and column indices in and respectively. This formulation applies to each matrix in the set . We use Grounded-SAM to obtain the object masks . is a boolean mask vector where each element corresponds to a token in the text features. indicates that the text token is activated at position . denotes the similarity score between the query and key. and can be calculated as:
(10) |
where denotes the maximum and represents minimum values. With our re-weighting attention modulation, the model is guided to focus more attention on the important tokens, corresponding to the generated local dense captions. Therefore, our method can refine the previously generated images for better semantic alignment with the text prompts.
IV Experiments
Dataset | Method | FID | CLIP | TIFA | Human Study | |
---|---|---|---|---|---|---|
Alignment | Fidelity | |||||
MS-COCO [12] | SD-v1.5 [3] | 13.7599 | 0.1626 | 0.78 | 8.9% | 2.1% |
SD-XL [3] | 7.0864 | 0.3578 | 0.84 | 14.9% | 19.8% | |
DeepFloyd-IF [50] | 7.5431 | 0.3433 | 0.87 | 16.1% | 16.7% | |
Imagereward [5] | 12.7248 | 0.1587 | 0.74 | 7.5% | 5.8% | |
DenseDiffusion [51] | 8.3359 | 0.1585 | 0.80 | 8.7% | 11.9% | |
Promptist [39] | 8.0351 | 0.1627 | 0.79 | 11.2% | 9.1% | |
StructureDiffusion [6] | 8.7603 | 0.3279 | 0.85 | 13.6% | 14.3% | |
PixArt- [7] | 7.9378 | 0.3449 | 0.84 | |||
RealignDiff (Ours) | 6.9617 | 0.3767 | 0.89 | 19.1% | 20.3% | |
ABC-6K [6] | SD-v1.5 [3] | 13.4539 | 0.1620 | 0.75 | 9.2% | 8.7% |
SD-XL [3] | 6.7145 | 0.3531 | 0.84 | 14.4% | 18.4% | |
DeepFloyd-IF [50] | 7.3319 | 0.3399 | 0.86 | 15.7% | 15.0% | |
Imagereward [5] | 12.4287 | 0.1592 | 0.71 | 9.8% | 6.8% | |
DenseDiffusion [51] | 8.1301 | 0.1604 | 0.79 | 7.1% | 11.1% | |
Promptist [39] | 8.0352 | 0.1636 | 0.77 | 11.6% | 9.3% | |
StructureDiffusion [6] | 8.6598 | 0.3301 | 0.83 | 12.8% | 12.2% | |
PixArt- [7] | 7.7364 | 0.3391 | 0.83 | |||
RealignDiff (Ours) | 6.5623 | 0.3782 | 0.88 | 19.4% | 18.5% | |
CC-500 [6] | SD-v1.5 [3] | 13.9598 | 0.1615 | 0.77 | 6.8% | 4.0% |
SD-XL [3] | 7.2364 | 0.3498 | 0.83 | 16.5% | 15.4% | |
DeepFloyd-IF [50] | 7.5494 | 0.3614 | 0.85 | 20.1% | 9.6% | |
Imagereward [5] | 12.8246 | 0.1582 | 0.74 | 5.3% | 1.9% | |
DenseDiffusion [51] | 8.1353 | 0.1635 | 0.82 | 10.1% | 18.0% | |
Promptist [39] | 8.4352 | 0.1593 | 0.77 | 7.3% | 14.4% | |
StructureDiffusion [6] | 8.8604 | 0.3281 | 0.85 | 13.1% | 12.3% | |
PixArt- [7] | 8.5431 | 0.3211 | 0.82 | |||
RealignDiff (Ours) | 7.1641 | 0.3761 | 0.90 | 20.8% | 24.4% | |
ViLG-300 [49] | SD-v1.5 [3] | 15.4943 | 0.1957 | 0.76 | 10.1% | 2.4% |
SD-XL [3] | 7.9753 | 0.4213 | 0.83 | 14.9% | 20.0% | |
DeepFloyd-IF [50] | 8.1541 | 0.4349 | 0.86 | 16.1% | 16.7% | |
Imagereward [5] | 13.6488 | 0.1906 | 0.75 | 7.5% | 4.8% | |
DenseDiffusion [51] | 8.6412 | 0.1907 | 0.81 | 8.7% | 11.9% | |
Promptist [39] | 9.2419 | 0.1962 | 0.77 | 11.2% | 9.5% | |
StructureDiffusion [6] | 9.1514 | 0.3936 | 0.86 | 13.6% | 14.3% | |
PixArt- [7] | 8.4496 | 0.4015 | 0.84 | |||
RealignDiff (Ours) | 7.2311 | 0.4527 | 0.90 | 17.9% | 20.4% |
IV-A Experimental Setup
Datasets. Our approach is trained on the MS-COCO [12] and ViLG-300 [13] datasets. The MS-COCO dataset comprises 82,783 training and 40,504 validation text-image pairs. We split the ViLG300 dataset into 80% for the training set and 20% for the test set. It is noteworthy that only the image captions from the training subset of the MS-COCO dataset and the ViLG-300 dataset are utilized for fine-tuning the model. For the evaluation, we have randomly selected 5,000 image captions from the validation subset of the MS-COCO dataset. Furthermore, our approach is also evaluated on the ABC-6K [6] and CC-500 [6]. The ABC-6K, derived from natural prompts within MS-COCO, each contains a minimum of two color descriptors modifying distinct objects. In contrast, the CC-500 consists of natural compositional prompts but primarily features simpler prompts that combine two concepts. These prompts follow sentence structures like “a red car and a pink elephant”, pairing objects with their respective attribute descriptors.
Evaluation Metrics. We adopt two metrics to measure the semantic consistency between the generated images and the input text prompts: CLIP [8, 52]and TIFA [53] scores. The higher the CLIP and TIFA scores, the better the semantic consistency. TIFA score uses a VQA method to evaluate alignment. Furthermore, the quality of generated images was assessed using the Frechet Inception Distance (FID)222https://github.com/mseitzer/pytorch-fid [54], where a lower FID score indicates better image quality. We also conducted a human study to gauge the semantic alignment of the generated images with their corresponding textual prompts. Participants in the study were presented with sets of images synthesized by the different text-to-image diffusion models alongside the input prompts that guided their generation. They were instructed to choose the order of different results in terms of the alignment and fidelity metrics. The alignment score measures the semantic consistency between the generated images and the input prompts. The fidelity score measures the quality of the generated images. We use the average rank from different participants as the final scores. This study collected a total of human evaluation results.
Implementation Details. Our algorithm is implemented in PyTorch. All experiments are conducted on servers equipped with eight Nvidia A100 GPUs, each with 40 GB of memory, and an AMD EPYC 7742 CPU running at 2.30 GHz. We adopt the Stable Diffusion v1.5 [3] as the foundational generative model and proceed to fine-tune it. We set a learning rate at 1e-5 and utilize a cumulative batch size of 128. All training and evaluations are conducted at a resolution of 512x512. We chose our fine-tuned checkpoint based on early stopping criteria to avoid overfitting and ‘reward hacking’, leading to performance degradation. The training was stopped after approximately 947 iterations when further improvements on validation metrics ceased. For each generation task, we set the random seed to 42 and generate images with a resolution of 512x512 pixels. The model is fine-tuned using half-precision floating-point numbers. For the ReFL algorithm, we configure the settings with , and .
IV-B Comparison Against Baselines
In this section, we conduct a comparative assessment of the proposed RealignDiff model against eight state-of-the-art text-to-image diffusion models. These include SD-v1.5 [3], SD-XL [3], DeepFloyd-IF [50], PixArt- [7], DenseDiffusion [51], Imagereward [5], Promptist [39], and StructureDiffusion [6]. Table I displays the quantitative comparison results of different methods on the MS-COCO, ABC-6K, CC-500 and ViLG-300 datasets.
As shown in Table I, RealignDiff demonstrates superior performance over the other state-of-the-art (SOTA) methods across all evaluated metrics. It is remarkable that our method’s performance slightly surpasses SDXL, even though we used SD v1.5 as the foundational generative model. Specifically, it achieved an FID score of 6.9617 on the MS-COCO dataset, which is significantly lower than those of competing methods, indicating its effectiveness. Additionally, in terms of CLIP and TIFA scores, our method reached 0.3767 and 0.89 on the MS-COCO dataset, respectively. These scores further underscore the ability of RealignDiff to generate semantically coherent and visually compelling images from textual descriptions.
Figures 1, 4 and 5 illustrate the qualitative comparison among different methods. It is evident that our RealignDiff achieves the best results in terms of both image quality and semantic consistency between the generated images and the text prompts. ImageReward, Promptist, and StructureDiffusion fail to depict all main objects; SD-v1.5, Midjourney, and DenseDiffusion exhibit misaligned attributes of the objects, such as color.
Figure 6 presents the generation comparison results for complex prompts. Compared to SDXL, our RealignDiff can accurately capture the attributes of objects, such as ”black and white cat”, ”golden retriever”, ”sunlight”, ”window”, and ”wooden”. Additionally, our method can correctly handle the relative positions of multiple objects, such as ”to the right”.
IV-C Ablation Study
In this section, we first study the effectiveness of the proposed coarse semantic re-alignment and fine semantic re-alignment modules. We then discuss the advantages of our proposed caption reward. Following this, we present a comparison of different LLMs. Finally, we showcase some intermediate generative results and attention maps to illustrate the performance of our approach.
Coarse & Fine Semantic Re-alignment. Table II presents the results of the ablation study for our coarse and fine semantic re-alignment modules on the MS-COCO dataset. The results demonstrate that, in terms of image quality, the coarse semantic re-alignment module reduces the FID from 13.7599 to 7.5349, and the fine semantic re-alignment module further reduces it to 6.9617. With respect to the semantic consistency between the generated images and the input texts, the coarse semantic re-alignment module improves the CLIP score from 0.1626 to 0.2548, and the fine semantic re-alignment module further improves it to 0.3767. These findings suggest that both the coarse and fine semantic re-alignment modules significantly enhance the performance of text-to-image diffusion models.
Coarse | Fine | FID | CLIP | TIFA |
---|---|---|---|---|
13.7599 | 0.1626 | 0.75 | ||
✓ | 7.5349 | 0.2548 | 0.85 | |
✓ | 9.3622 | 0.1371 | 0.78 | |
✓ | ✓ | 6.9617 | 0.3767 | 0.89 |
Figure 7 further underscores the effectiveness of the coarse and fine semantic re-alignment modules. The figure reveals that without the coarse semantic re-alignment, the text-to-image diffusion model often fails to capture the main objects mentioned in the text prompts. Without the fine semantic re-alignment, the model struggles to accurately represent the attributes and relationships of the objects described in the input text. However, when both the coarse and fine semantic re-alignment modules are applied, the text-to-image diffusion model is capable of generating high-quality images that are semantically aligned with the input texts.
Caption Reward. The reward function is a pivotal component in the coarse semantic re-alignment stage. In this subsection, we evaluate our proposed CaptionReward against other reward functions such as CLIP reward, BLIP reward, and ImageReward. Table III provides the comparative results among these reward functions on the MS-COCO dataset
As shown in Table III, our novel Caption Reward outperforms all other reward functions in all metrics, which include CLIP Reward, BLIP Reward, and Image Reward. This superiority can be attributed to CaptionReward’s methodology of calculating the reward score by measuring the similarity between the generated caption and the input prompt, rather than measuring the similarity between the generated image and the input prompt, which is the approach taken by the other rewards. The detailed caption provides more nuanced guidance on the appropriateness of the surrounding concepts and context within the image with respect to the given text prompt. Additionally, Figure 8 further demonstrates the effectiveness of the proposed Caption Reward. As depicted in Figure 8, the text-to-image diffusion model, when re-aligned using the Caption Reward, is capable of generating images that are not only of higher quality but also more semantically aligned with the input text than those produced using other reward functions.
Comparison of different LLMs. The efficacy of the fine semantic re-alignment module is intrinsically linked to its ability to tag images accurately and modulate attributes effectively through large language models (LLMs). This section aims to delve into an ablation study that analyzes the success rates of various image tagging models and LLMs on the ViLG-300 dataset, including ChatGPT, GPT-4 [55], Vicuna-7b [56], and Llama2-7b [57].
Table IV showcases the success rates of both RAM and Tag2Text when paired with the aforementioned LLMs. It is noteworthy that the success rate of RAM+GPT-4 can achieve 99%. The findings indicate: 1) ChatGPT, while slightly lagging behind GPT-4, presents promising outcomes, especially with RAM. This underscores the versatility and robustness of the ChatGPT model; 2) On the other end of the spectrum, Llama2-7b exhibits the lowest success rates with both tagging models. This could hint at possible areas of refinement or potential incompatibilities between the tag model and LLM. In the future, we can improve performance and save costs by specifying fine-tuning of the task.
Intermediate generative results and attention maps. Figure 9 presents the intermediate generative results of RealignDiff, displaying the progression from left to right: the initial coarse images following coarse semantic re-alignment, the segmented maps derived from the coarse images, and the final high-quality, semantically-aligned images produced after fine semantic re-alignment, which achieve fine-grained attribute binding. In Figure 9(a), the vase undergoes a reassignment of its attribute to yellow. In Figure 9(b), the quantity of cats is reassigned to one, their location is redefined as on the bowl, and their color is reassigned to a black-and-white pattern.
We use Diffusion Attentive Attribution Maps (DAAM) [59] to visualize the intermediate attention maps of key attribute tokens before and after fine semantic re-alignment. As displayed in Figure 10, the color attributes black and blue get more attention in the designated regions after fine semantic re-alignment, leading to better alignment and generative performance.
V Conclusion
In this paper, we propose a novel two-stage coarse-to-fine semantic re-alignment method, RealignDiff, to enhance the alignment between descriptions and corresponding images within the text-to-image diffusion models. The initial coarse semantic re-alignment stage entails fine-tuning the text-to-image model from a global semantic perspective. This stage is crucial for ensuring that the generated images faithfully depict the objects and entities described within the given textual input. The fine semantic re-alignment stage occurs without the need for additional training data, allowing for the accurate capture of object attributes and relationships. Experimental results on MS-COCO and ViLG-300 datasets demonstrate that RealignDiff outperforms other baselines in terms of both visual quality and semantic similarity with input prompt.
Limitations and future works In the fine semantic re-alignment stage, if the large language model fails to provide accurate intermediate results, it may hinder the refinement of previously generated images. Future work will focus on overcoming this limitation. Additionally, we aim to explore dynamic learning from multiple reward functions, such as semantic and aesthetic, within the diffusion model.
References
- [1] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 479–36 494, 2022.
- [2] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
- [3] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2022, pp. 10 684–10 695.
- [4] X. Xu, Z. Wang, E. Zhang, K. Wang, and H. Shi, “Versatile diffusion: Text, images and variations all in one diffusion model,” arXiv preprint arXiv:2211.08332, 2022.
- [5] J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, and Y. Dong, “Imagereward: Learning and evaluating human preferences for text-to-image generation,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- [6] W. Feng, X. He, T.-J. Fu, V. Jampani, A. R. Akula, P. Narayana, S. Basu, X. E. Wang, and W. Y. Wang, “Training-free structured diffusion guidance for compositional text-to-image synthesis,” in The Eleventh International Conference on Learning Representations, 2023.
- [7] J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, and Z. Li, “Pixart-: Fast training of diffusion transformer for photorealistic text-to-image synthesis,” in The Twelve International Conference on Learning Representations, 2024.
- [8] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- [9] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning. PMLR, 2022, pp. 12 888–12 900.
- [10] G. Kim, T. Kwon, and J. C. Ye, “Diffusionclip: Text-guided diffusion models for robust image manipulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2426–2435.
- [11] B. Li, X. Wang, X. Xu, Y. Hou, Y. Feng, F. Wang, and W. Che, “Semantic-guided image augmentation with pre-trained models,” arXiv preprint arXiv:2302.02070, 2023.
- [12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755.
- [13] Z. Feng, Z. Zhang, X. Yu, Y. Fang, L. Li, X. Chen, Y. Lu, J. Liu, W. Yin, S. Feng et al., “Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 135–10 145.
- [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
- [15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention Is All You Need,” in Neural Information Processing Systems, 2017.
- [16] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.
- [17] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in International Conference on Machine Learning. PMLR, 2016, pp. 1060–1069.
- [18] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “Stackgan++: Realistic image synthesis with stacked generative adversarial networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, pp. 1947–1962, 2018.
- [19] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-grained text to image generation with attentional generative adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1316–1324.
- [20] B. Li, X. Qi, T. Lukasiewicz, and P. H. Torr, “Controllable text-to-image generation,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, pp. 2065–2075.
- [21] H. Dong, S. Yu, C. Wu, and Y. Guo, “Semantic image synthesis via adversarial learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5706–5714.
- [22] M. Zhu, P. Pan, W. Chen, and Y. Yang, “Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5802–5810.
- [23] M. Tao, H. Tang, S. Wu, N. Sebe, X.-Y. Jing, F. Wu, and B. Bao, “Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis,” arXiv preprint arXiv:2008.05865, 2020.
- [24] H. Ye, X. Yang, M. Takac, R. Sunderraman, and S. Ji, “Improving text-to-image synthesis using contrastive learning,” arXiv preprint arXiv:2107.02423, 2021.
- [25] M. Kang, J.-Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, and T. Park, “Scaling up gans for text-to-image synthesis,” arXiv preprint arXiv:2303.05511, 2023.
- [26] A. Sauer, T. Karras, S. Laine, A. Geiger, and T. Aila, “Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis,” arXiv preprint arXiv:2301.09515, 2023.
- [27] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International Conference on Machine Learning. PMLR, 2021, pp. 8821–8831.
- [28] M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang et al., “Cogview: Mastering text-to-image generation via transformers,” Advances in Neural Information Processing Systems, vol. 34, pp. 19 822–19 835, 2021.
- [29] P. Esser, R. Rombach, A. Blattmann, and B. Ommer, “Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis,” Advances in Neural Information Processing Systems, vol. 34, pp. 3518–3532, 2021.
- [30] M. Ding, W. Zheng, W. Hong, and J. Tang, “Cogview2: Faster and better text-to-image generation via hierarchical transformers,” arXiv preprint arXiv:2204.14217, 2022.
- [31] Z. Zhang, J. Ma, C. Zhou, R. Men, Z. Li, M. Ding, J. Tang, J. Zhou, and H. Yang, “M6-ufc: Unifying multi-modal controls for conditional image synthesis,” arXiv preprint arXiv:2105.14211, 2021.
- [32] D. Lee, C. Kim, S. Kim, M. Cho, and W.-S. Han, “Autoregressive image generation using residual quantization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 523–11 532.
- [33] H. Chang, H. Zhang, J. Barber, A. Maschinot, J. Lezama, L. Jiang, M.-H. Yang, K. Murphy, W. T. Freeman, M. Rubinstein et al., “Muse: Text-to-image generation via masked generative transformers,” arXiv preprint arXiv:2301.00704, 2023.
- [34] A. Van Den Oord, O. Vinyals et al., “Neural discrete representation learning,” Advances in neural information processing systems, vol. 30, 2017.
- [35] A. Razavi, A. van den Oord, and O. Vinyals, “Generating diverse high-fidelity images with vq-vae-2,” in Advances in neural information processing systems, 2019, pp. 14 866–14 876.
- [36] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 873–12 883.
- [37] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021.
- [38] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, “Cascaded diffusion models for high fidelity image generation.” J. Mach. Learn. Res., vol. 23, pp. 47–1, 2022.
- [39] Y. Hao, Z. Chi, L. Dong, and F. Wei, “Optimizing prompts for text-to-image generation,” arXiv preprint arXiv:2212.09611, 2022.
- [40] K. Lee, H. Liu, M. Ryu, O. Watkins, Y. Du, C. Boutilier, P. Abbeel, M. Ghavamzadeh, and S. S. Gu, “Aligning text-to-image models using human feedback,” arXiv preprint arXiv:2302.12192, 2023.
- [41] X. Wu, K. Sun, F. Zhu, R. Zhao, and H. Li, “Better aligning text-to-image models with human preference,” arXiv preprint arXiv:2303.14420, 2023.
- [42] H. Dong, W. Xiong, D. Goyal, R. Pan, S. Diao, J. Zhang, K. Shum, and T. Zhang, “Raft: Reward ranked finetuning for generative foundation model alignment,” arXiv preprint arXiv:2304.06767, 2023.
- [43] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
- [44] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241.
- [45] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
- [46] Y. Zhang, X. Huang, J. Ma, Z. Li, Z. Luo, Y. Xie, Y. Qin, T. Luo, Y. Li, S. Liu et al., “Recognize anything: A strong image tagging model,” arXiv preprint arXiv:2306.03514, 2023.
- [47] OpenAI, “Gpt-4 technical report,” arXiv:2303.08774, 2023.
- [48] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023.
- [49] Z. Feng, Z. Zhang, X. Yu, Y. Fang, L. Li, X. Chen, Y. Lu, J. Liu, W. Yin, S. Feng, Y. Sun, L. Chen, H. Tian, H. Wu, and H. Wang, “Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts,” 2023.
- [50] A. Shonenkov, M. Konstantinov, D. Bakshandaeva, C. Schuhmann, K. Ivanova, and N. Klokova, “Deepfloyd if: A powerful text-to-image model that can smartly integrate text into images,” 2023, online; accessed 16-November-2023. [Online]. Available: https://www.deepfloyd.ai/deepfloyd-if
- [51] Y. Kim, J. Lee, J.-H. Kim, J.-W. Ha, and J.-Y. Zhu, “Dense text-to-image generation with attention modulation,” arXiv preprint arXiv:2308.12964, 2023.
- [52] J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi, “Clipscore: A reference-free evaluation metric for image captioning,” 2022.
- [53] Y. Hu, B. Liu, J. Kasai, Y. Wang, M. Ostendorf, R. Krishna, and N. A. Smith, “Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering,” arXiv preprint arXiv:2303.11897, 2023.
- [54] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” 2018.
- [55] OpenAI, “Gpt-4 technical report,” 2024.
- [56] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” March 2023.
- [57] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- [58] X. Huang, Y. Zhang, J. Ma, W. Tian, R. Feng, Y. Zhang, Y. Li, Y. Guo, and L. Zhang, “Tag2text: Guiding vision-language model via image tagging,” 2023.
- [59] R. Tang, L. Liu, A. Pandey, Z. Jiang, G. Yang, K. Kumar, P. Stenetorp, J. Lin, and F. Ture, “What the daam: Interpreting stable diffusion using cross attention,” arXiv preprint arXiv:2210.04885, 2022.