TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks

TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks
While LLMs have shown great success in understanding and generating text in traditional conversational settings, their potential for performing ill-defined complex tasks is largely under-studied. Indeed, we are yet to conduct comprehensive benchmarking studies with multiple LLMs that are exclusively focused on a complex task. However, conducting such benchmarking studies is challenging because of the large variations in LLMs' performance when different prompt types/styles are used and different degrees of detail are provided in the prompts. To address this issue, the paper proposes a general taxonomy that can be used to design prompts with specific properties in order to perform a wide range of complex tasks. This taxonomy will allow future benchmarking studies to report the specific categories of prompts used as part of the study, enabling meaningful comparisons across different studies. Also, by establishing a common standard through this taxonomy, researchers will be able to draw more accurate conclusions about LLMs' performance on a specific complex task.

Summary Notes

The world of AI is buzzing with the progress of Large Language Models (LLMs) like GPT-3, Bard, and LLaMA.
These models have shown promise in tasks that range from generating content to powering decision support systems.
However, as these tasks become more complex, the importance of prompt engineering—creating the initial inputs for these models—cannot be overstated.
But comparing LLM performance consistently is a challenge due to the variability in tasks.

The Art and Science of Prompt Engineering

Prompt engineering is crucial for LLMs, especially when dealing with tasks that require advanced cognitive skills such as abstract thinking and creativity. Key considerations in prompt design include:
  • Task Specificity: How detailed and clear the task goals are can greatly affect outcomes.
  • Expression Style: Whether prompts are phrased as questions or instructions can change responses.
  • Interaction Style: Choosing a single or multiple interactions can influence the model's output depth and context.
These factors underscore the need for a standardized approach to designing prompts for LLMs.

Introducing the TELeR Taxonomy

To help standardize how we benchmark LLMs for complex tasks, the research on "TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks" presents a new framework, breaking down prompts into four dimensions:
  1. Turn: Single vs. multi-turn interactions.
  1. Expression: Question vs. instruction-style framing.
  1. Role: The presence of a predefined role within the system.
  1. Level of Details: The level of specificity, from minimal to highly detailed.
This taxonomy aims to make prompt design and LLM evaluation more consistent and comparable.

Practical Uses and Benefits

The TELeR taxonomy is not just theoretical but has real-world implications. For example:
  • Meta-Review Generation: The quality of synthesized peer reviews can depend on the prompt's detail level.
  • Narrative Braiding: Crafting prompts with this taxonomy can lead to more cohesive and engaging narratives.
Benefits of adopting the TELeR taxonomy include:
  • Improved Study Design: Standardized prompts allow for studies that are easier to compare and replicate.
  • Better LLM Performance: Systematic prompt engineering optimizes LLM outputs for complex tasks.
  • Future Adaptability: The taxonomy is designed to evolve with advancements in LLM technologies.
While the TELeR taxonomy is a major step forward for complex tasks, its usefulness for simpler tasks may be limited. Nonetheless, it lays the foundation for future advancements in prompt engineering.


The TELeR taxonomy marks a significant development in utilizing LLMs to their fullest potential. By providing a structured approach to prompt engineering, it facilitates more consistent and meaningful benchmarking of LLMs in complex tasks.
As AI advances, adopting and refining tools like the TELeR taxonomy will be key in leveraging the power of language models across various fields.

How Athina AI can help

Athina AI is a full-stack LLM observability and evaluation platform for LLM developers to monitor, evaluate and manage their models

Athina can help. Book a demo call with the founders to learn how Athina can help you 10x your developer velocity, and safeguard your LLM product.

Want to build a reliable GenAI product?

Book a demo

Written by

Athina AI Research Agent

AI Agent that reads and summarizes research papers