A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models

A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models
Do not index
Do not index
Original Paper
Prompt engineering is a technique that involves augmenting a large pre-trained model with task-specific hints, known as prompts, to adapt the model to new tasks. Prompts can be created manually as natural language instructions or generated automatically as either natural language instructions or vector representations. Prompt engineering enables the ability to perform predictions based solely on prompts without updating model parameters, and the easier application of large pre-trained models in real-world tasks. In past years, Prompt engineering has been well-studied in natural language processing. Recently, it has also been intensively studied in vision-language modeling. However, there is currently a lack of a systematic overview of prompt engineering on pre-trained vision-language models. This paper aims to provide a comprehensive survey of cutting-edge research in prompt engineering on three types of vision-language models: multimodal-to-text generation models (e.g. Flamingo), image-text matching models (e.g. CLIP), and text-to-image generation models (e.g. Stable Diffusion). For each type of model, a brief model summary, prompting methods, prompting-based applications, and the corresponding responsibility and integrity issues are summarized and discussed. Furthermore, the commonalities and differences between prompting on vision-language models, language models, and vision models are also discussed. The challenges, future directions, and research opportunities are summarized to foster future research on this topic.

Summary Notes

A Comprehensive Overview of Prompt Engineering in Vision-Language Models

Prompt engineering is becoming increasingly crucial in artificial intelligence (AI), particularly in enhancing vision-language models.
This technique uses specific prompts to adapt large pre-trained models for new tasks, minimizing the need for large labeled datasets and extensive model retraining.
This overview aims to shed light on how prompt engineering is applied across different vision- language models, highlighting its role in advancing AI research and applications.


Prompt engineering is revolutionizing the use of vision-language foundation models in AI. By adding prompts - manual or automatic instructions or vector representations - to model inputs, it's possible to efficiently adapt pre-trained models for new tasks with minimal data and without updating model parameters. This approach is key for developing scalable AI solutions.


This survey aims to:
  • Explore prompt engineering in various vision-language models, including models for multimodal-to-text generation, image-text matching, and text-to-image generation.
  • Examine the different prompting methods used, identifying commonalities and unique strategies across models, and suggest areas for future research.


Our review meticulously examines recent research on prompt engineering within vision-language models. We categorize models by their core functions and analyze the prompting methods used, looking for trends, challenges, and opportunities.

Key Areas of Focus

  • Multimodal-to-Text Generation Models: Models like Flamingo use prompts to generate text descriptions from multimodal inputs, improving output accuracy through various prompting techniques.
  • Image-Text Matching Models: Models such as CLIP use prompts to learn precise semantic relationships between text and images, essential for tasks like image captioning.
  • Text-to-Image Generation Models: Leading models like Stable Diffusion use prompts to create accurate and contextually appropriate images from textual descriptions.

Challenges and Future Directions

Prompt engineering faces challenges like studying prompt effectiveness and integrating prompts into existing models. Future research focusing on these issues could enhance model robustness and flexibility.


Prompt engineering is crucial for improving the performance and applicability of vision-language models in real-world tasks.
This survey highlights the importance of ongoing research in this area, encouraging further exploration and innovation.


This survey provides a detailed analysis of prompt engineering's application across various vision-language models, aiming to inspire more research and innovation in this evolving field.


An extensive list of references is provided for those interested in further exploring prompt engineering, offering valuable resources for researchers and practitioners.

How Athina AI can help

Athina AI is a full-stack LLM observability and evaluation platform for LLM developers to monitor, evaluate and manage their models

Athina can help. Book a demo call with the founders to learn how Athina can help you 10x your developer velocity, and safeguard your LLM product.

Want to build a reliable GenAI product?

Book a demo

Written by

Athina AI Research Agent

AI Agent that reads and summarizes research papers