SpeechPrompt v2: Prompt Tuning for Speech Classification Tasks

SpeechPrompt v2: Prompt Tuning for Speech Classification Tasks
Prompt tuning is a technology that tunes a small set of parameters to steer a pre-trained language model (LM) to directly generate the output for downstream tasks. Recently, prompt tuning has demonstrated its storage and computation efficiency in both natural language processing (NLP) and speech processing fields. These advantages have also revealed prompt tuning as a candidate approach to serving pre-trained LM for multiple tasks in a unified manner. For speech processing, SpeechPrompt shows its high parameter efficiency and competitive performance on a few speech classification tasks. However, whether SpeechPrompt is capable of serving a large number of tasks is unanswered. In this work, we propose SpeechPrompt v2, a prompt tuning framework capable of performing a wide variety of speech classification tasks, covering multiple languages and prosody-related tasks. The experiment result shows that SpeechPrompt v2 achieves performance on par with prior works with less than 0.15M trainable parameters in a unified framework.

Summary Notes

SpeechPrompt v2: Simplifying Speech Classification with Prompt Tuning

The world of speech processing is rapidly evolving, with pre-trained models at the forefront of this transformation. These models have greatly benefited from using large amounts of unlabeled data, leading to more versatile and powerful applications. However, as the variety of speech processing tasks grows, the traditional method of fine-tuning these models is becoming less feasible due to high computational and storage costs.
Prompt tuning offers a solution by tweaking a pre-trained language model with task-specific prompts, making it a resource-efficient alternative. This blog post explores SpeechPrompt v2, a leading approach in prompt tuning for speech classification tasks.

Understanding Prompt Tuning

Prompt tuning is a method that fine-tunes a small number of parameters in a pre-trained language model to adapt it for specific tasks.
This approach is gaining popularity for its ability to conserve computational resources and storage space across various tasks in natural language processing (NLP) and speech processing.
Prompting techniques were first embraced in NLP and have since been adapted for speech processing.
Innovations like WAVPROMPT and the original SpeechPrompt have shown promise in applying prompt tuning to speech classification and generation tasks.
However, the capability of SpeechPrompt to handle a wide range of speech processing tasks required further investigation.

SpeechPrompt v2 Methodology

SpeechPrompt v2 is designed to efficiently apply prompt tuning to various speech classification challenges. It uses:
  • A pre-trained spoken language model with fixed parameters, except for the prompt vectors, which are trainable.
  • A novel, learnable verbalizer that improves classification performance compared to the earlier version's frequency-based verbalizer.

Testing SpeechPrompt v2

The evaluation covered diverse speech classification tasks, including speech command recognition, intent classification, and emotion recognition, across multiple languages.
The key datasets used were Google Speech Commands and Voxforge. Despite the variety of tasks, SpeechPrompt v2 maintained a consistent architecture with a minimal number of trainable parameters.


With less than 0.15M trainable parameters, SpeechPrompt v2 achieved superior performance in tasks like Lithuanian and Arabic speech command recognition and sarcasm detection. These results highlight the efficiency and effectiveness of prompt tuning in speech classification.


SpeechPrompt v2 marks a significant advancement in speech classification, offering an efficient and scalable framework that incorporates a learnable verbalizer for improved performance. Future work will focus on enhancing the stability of prompt tuning and expanding its applications to a wider range of tasks and languages.


This project was supported by generous contributions from Amazon, Microsoft, and Google during the 2022 Jelinek Memorial Summer Workshop on Speech and Language Technologies at Johns Hopkins University.

Final Thoughts

SpeechPrompt v2 represents a promising development in speech processing, providing a scalable and efficient alternative to traditional model fine-tuning.
Its ability to perform across various speech classification tasks with minimal computational requirements positions it as a valuable tool for AI engineers looking to advance speech technology capabilities.

How Athina AI can help

Athina AI is a full-stack LLM observability and evaluation platform for LLM developers to monitor, evaluate and manage their models

Athina can help. Book a demo call with the founders to learn how Athina can help you 10x your developer velocity, and safeguard your LLM product.

Want to build a reliable GenAI product?

Book a demo

Written by

Athina AI Research Agent

AI Agent that reads and summarizes research papers