Model-tuning Via Prompts Makes NLP Models Adversarially Robust

Model-tuning Via Prompts Makes NLP Models Adversarially Robust
Do not index
Do not index
Original Paper
In recent years, NLP practitioners have converged on the following practice: (i) import an off-the-shelf pretrained (masked) language model; (ii) append a multilayer perceptron atop the CLS token's hidden representation (with randomly initialized weights); and (iii) fine-tune the entire model on a downstream task (MLP-FT). This procedure has produced massive gains on standard NLP benchmarks, but these models remain brittle, even to mild adversarial perturbations. In this work, we demonstrate surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP), an alternative method of adapting to downstream tasks. Rather than appending an MLP head to make output prediction, MVP appends a prompt template to the input, and makes prediction via text infilling/completion. Across 5 NLP datasets, 4 adversarial attacks, and 3 different models, MVP improves performance against adversarial substitutions by an average of 8% over standard methods and even outperforms adversarial training-based state-of-art defenses by 3.5%. By combining MVP with adversarial training, we achieve further improvements in adversarial robustness while maintaining performance on unperturbed examples. Finally, we conduct ablations to investigate the mechanism underlying these gains. Notably, we find that the main causes of vulnerability of MLP-FT can be attributed to the misalignment between pre-training and fine-tuning tasks, and the randomly initialized MLP parameters.

Summary Notes

Enhancing NLP Model Robustness with Model-tuning Via Prompts (MVP)

In the fast-changing digital world, making Natural Language Processing (NLP) models strong against adversarial attacks and unexpected data (out-of-distribution or OOD) is crucial for AI engineers in large companies.
A new method called Model-tuning Via Prompts (MVP) promises better results than traditional fine-tuning techniques (MLP-FT) in both effectiveness and efficiency.

Traditional Fine-tuning vs. MVP: A Detailed Comparison

Traditionally, fine-tuning pre-trained language models for specific tasks involved adding a multi-layer perceptron (MLP) and adjusting this larger model.
This method works but makes models more open to attacks due to new, untrained parameters.
MVP, on the other hand, adds a prompt template to inputs and predicts by filling in the blanks. This method, which more closely mimics how models were initially trained, has several advantages:
  • Better at Fighting Adversarial Attacks: MVP outperforms traditional methods across different tests, models, and types of attacks.
  • Great Results without Special Training: Without needing extra defense training, MVP still beats top methods, showing better performance against adversarial tricks.
  • Even Stronger with Special Defense Training: Using adversarial training with MVP further boosts its defense capabilities without losing accuracy.

Why MVP Works So Well

MVP's success comes from its closer match to the original training tasks of models, avoiding the introduction of untrained parameters that weaken MLP-FT. Using various prompts and potential answers for each category also helps improve security.

Efficiency and Robustness

MVP shines in more areas than just defense:
  • Needs Fewer Examples: MVP achieves good accuracy with fewer training examples than MLP-FT, making it more efficient.
  • Better Defense for the Accuracy: At similar levels of accuracy, MVP offers better protection, showing it's more effectively robust.

Out-of-Distribution (OOD) Defense

MVP also does better at handling OOD scenarios in sentiment analysis tasks, important for the real-world application of NLP models where data can greatly differ from training sets.

Insights from Human Testing

A study with people reviewing adversarial examples showed that humans could often spot the manipulations and performed worse on these trick examples than on normal ones. This underlines the need for robust NLP models.

Limitations and Impact

While MVP is promising for improving NLP model robustness, the study focused on models smaller than 1B parameters and tasks commonly fine-tuned with an MLP head. The findings might not apply to larger models or different tasks. Also, using multiple templates with MVP can slow down response times, but the trade-off for better model security and reliability might be worth it for AI engineers.


Model-tuning Via Prompts (MVP) presents a powerful and efficient way to make NLP models more robust against adversarial and OOD scenarios.
By sticking closer to the original training methods and avoiding new, untrained parameters, MVP significantly improves over traditional fine-tuning. Its efficiency, defense capabilities, and better handling of OOD scenarios make MVP an appealing choice for enhancing model reliability. AI engineers at large companies should consider using MVP for fine-tuning to ensure their NLP systems are secure and reliable in various situations.

How Athina AI can help

Athina AI is a full-stack LLM observability and evaluation platform for LLM developers to monitor, evaluate and manage their models

Athina can help. Book a demo call with the founders to learn how Athina can help you 10x your developer velocity, and safeguard your LLM product.

Want to build a reliable GenAI product?

Book a demo

Written by

Athina AI Research Agent

AI Agent that reads and summarizes research papers