# Image_to_Prompts.pdf

Graduate student um National Yang-Ming Chiao-Tung Univeristy (NYCU)
1. Jun 2023
1 von 28

### Image_to_Prompts.pdf

• 1. Image to Prompts Stable Diffusion - Image to Prompts Featured Code Competition, Kaggle Pin-Yen Liu, Po-Chun Huang, Po-Chuan Chen June 2, 2023 1 / 28
• 2. Image to Prompts Table of contents 1 Motivation 2 Dataset 3 Method 4 Analysis 5 Discussion 6 Reference 2 / 28
• 3. Image to Prompts Motivation Table of contents 1 Motivation 2 Dataset 3 Method 4 Analysis 5 Discussion 6 Reference 3 / 28
• 4. Image to Prompts Motivation Motivation This project is a Kaggle competition called Image to Prompts [1]. In this competition, we need to reverse the typical direction of a generative text-to-image model: instead of generating an image from a text prompt. We want to create a model which can predict the text prompt given a generated image. And making predictions on a dataset containing a wide variety of (prompt, image) pairs generated by Stable Diffusion 2.0, to understand how reversible the latent relationship is. 4 / 28
• 5. Image to Prompts Motivation Here has an example, if we have an image like this below We may need to generate a prompt ultrasaurus holding a black bean taco and a piece of cheese in the woods for this image. 5 / 28
• 6. Image to Prompts Dataset Table of contents 1 Motivation 2 Dataset 3 Method 4 Analysis 5 Discussion 6 Reference 6 / 28
• 7. Image to Prompts Dataset Dataset Because our method is to use pre-trained model. So there has no dataset for our method. But, there have some datasets that can be used in this competition. The dataset is build in (image, prompt) pairs. For example: DiffusuinDB [8] : 14 million image-text pairs COCO (Microsoft Common Objects in Context) [6] : 2.5 million labeled instances in 328k images Laion2B-en [10] : 2.3 billion image-text pairs 7 / 28
• 8. Image to Prompts Method Table of contents 1 Motivation 2 Dataset 3 Method ViT CLIP interrogator OFA 4 Analysis 5 Discussion 8 / 28
• 9. Image to Prompts Method Method Our method is to ensemble the CLIP Interrogator, OFA model,and ViT model. Here’s the ratio for three different model 1 Vision Transformer (ViT) model: 74.88% 2 CLIP Interrogator: 21.12% 3 OFA model fine-tuned for image captioning: 4% 9 / 28
• 10. Image to Prompts Method ViT ViT The Vision Transformer (ViT) [3] is a transformer encoder model (BERT-like), and pretrained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. 10 / 28
• 11. Image to Prompts Method CLIP interrogator CLIP interrogator The CLIP Interrogator is a prompt engineering tool that combines OpenAI’s CLIP [7] and Salesforce’s BLIP [5] to optimize text prompts to match a given image. Figure: CLIP interrogator : A tool on Hugging Face Space 11 / 28
• 12. Image to Prompts Method CLIP interrogator CLIP interrogator pipeline Figure: CLIP Interrogator pipeline 12 / 28
• 13. Image to Prompts Method CLIP interrogator BLIP Figure: Framework of BLIP 13 / 28
• 14. Image to Prompts Method OFA OFA OFA[9] is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.) to a simple sequence-to-sequence learning framework. In here, we use OFA-large-caption for image captioning for pre-trained model, it uses COCO dataset for training the model. 14 / 28
• 15. Image to Prompts Analysis Table of contents 1 Motivation 2 Dataset 3 Method 4 Analysis Sentence Transformer ViT CLIP interrogator OFA 5 Discussion 15 / 28
• 16. Image to Prompts Analysis Sentence Transformer For the competition, we need to calculate Stable Diffusion Prompt Embeddings. Input will be the prompt sentence, output will be the embedding. We can get the embedding with using sentence transformer. The model called Sentence-BERT [8]. Figure: SBERT architecture at inference 16 / 28
• 17. Image to Prompts Analysis Sentence Transformer Figure: Ground Truth Prompts Figure: Text Embedding for each prompt. Each of them has 384 dimensions. 17 / 28
• 18. Image to Prompts Analysis ViT ViT Figure: Label generated by Vision Transformer 18 / 28
• 19. Image to Prompts Analysis CLIP interrogator CLIP interrogator Figure: Prompts generated by CLIP interrogator 19 / 28
• 20. Image to Prompts Analysis OFA OFA Model EDA Figure: Prompts generated by OFA model 20 / 28
• 21. Image to Prompts Analysis OFA Model Caption SBERT Score ViT comic book, triceratops, coffee mug, jigsaw puzzle, book jacket, dust cover, dust jacket 0.553 CLIP Interrogator a cartoon dinosaur with a piece of cheese in its mouth, boardgamegeek, gaia, woodland background, pancake flat head, inspired by Charles Fremont Conner, mobile learning app prototype, is essentially arbitrary, blue scales, museum item, nachos, prehistory, plates 0.739 OFA A blue dinosaur eating a piece of cheese in a forest 0.772 21 / 28
• 22. Image to Prompts Discussion Table of contents 1 Motivation 2 Dataset 3 Method 4 Analysis 5 Discussion 6 Reference 22 / 28
• 23. Image to Prompts Discussion Discussion This competition aims to explore the reversibility of the latent relationship between text prompts and generated images, challenging participants to predict the text prompt from a given image. It highlights the importance of prompt engineering in text-to-image models and the potential for understanding the intricate relationships between prompts and images. The combination of CLIP Interrogator, OFA, and ViT models provides a robust approach for analyzing and predicting prompt embeddings. Figure: Ranking on Kaggle Leader-board (Top 36%) 23 / 28
• 24. Image to Prompts Discussion Future work But the method for this competition can still be improved. Using the generation of a large dataset of images and prompts, along with modifications and training of models, demonstrates the potential for further improvements in the field of text-to-image generation and prompt engineering (Score: 0.66316, Rank 1) [2]. Or, the utilization of a ViT-based method and the creation of a customized dataset, indicating the importance of supervised learning in predicting sentence embeddings and its relevance to advancing text-to-image models (Score: 0.65179, Rank 2) [4]. 24 / 28
• 25. Image to Prompts Reference Table of contents 1 Motivation 2 Dataset 3 Method 4 Analysis 5 Discussion 6 Reference 25 / 28
• 26. Image to Prompts Reference References I [1] Will Cukierski Ashley Chow inversion. Stable Diffusion - Image to Prompts. 2023. url: https://kaggle.com/competitions/stable- diffusion-image-to-prompts. [2] bestfitting. 1st Place Solution. 2023. url: https://www.kaggle.com/competitions/stable- diffusion-image-to-prompts/discussion/411237. [3] Alexey Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2021. arXiv: 2010.11929 [cs.CV]. [4] KaizaburoChubachi. 2nd place solution. 2023. url: https://www.kaggle.com/competitions/stable- diffusion-image-to-prompts/discussion/410606. 26 / 28
• 27. Image to Prompts Reference References II [5] Junnan Li et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. 2022. arXiv: 2201.12086 [cs.CV]. [6] Tsung-Yi Lin et al. Microsoft COCO: Common Objects in Context. 2015. arXiv: 1405.0312 [cs.CV]. [7] Alec Radford et al. Learning Transferable Visual Models From Natural Language Supervision. 2021. arXiv: 2103.00020 [cs.CV]. [8] Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”. In: CoRR abs/1908.10084 (2019). arXiv: 1908.10084. url: http://arxiv.org/abs/1908.10084. 27 / 28
• 28. Image to Prompts Reference References III [9] Peng Wang et al. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. 2022. arXiv: 2202.03052 [cs.CV]. [10] Ryan Webster et al. On the De-duplication of LAION-2B. 2023. arXiv: 2303.12733 [cs.CV]. 28 / 28