Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness

Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness
 
Abstract:
Generative pre-trained language models (GPLMs) like ChatGPT encode in the model's parameters knowledge the models observe during the pre-training phase. This knowledge is then used at inference to address the task specified by the user in their prompt. For example, for the question-answering task, the GPLMs leverage the knowledge and linguistic patterns learned at training to produce an answer to a user question. Aside from the knowledge encoded in the model itself, answers produced by GPLMs can also leverage knowledge provided in the prompts. For example, a GPLM can be integrated into a retrieve-then-generate paradigm where a search engine is used to retrieve documents relevant to the question; the content of the documents is then transferred to the GPLM via the prompt. In this paper we study the differences in answer correctness generated by ChatGPT when leveraging the model's knowledge alone vs. in combination with the prompt knowledge. We study this in the context of consumers seeking health advice from the model. Aside from measuring the effectiveness of ChatGPT in this context, we show that the knowledge passed in the prompt can overturn the knowledge encoded in the model and this is, in our experiments, to the detriment of answer correctness. This work has important implications for the development of more robust and transparent question-answering systems based on generative pre-trained language models.
 

Summary Notes

The Impact of Prompt Knowledge on ChatGPT's Health Advice Accuracy

Generative Pre-trained Language Models (GPLMs) like ChatGPT have significantly changed our interaction with technology, especially in healthcare.
These AI models are adept at producing human-like text, making them useful for automating tasks, including providing health advice.
Yet, their effectiveness in delivering accurate health advice greatly depends on the prompts users provide.
This article explores the crucial role of prompt knowledge in the accuracy of health advice from ChatGPT, drawing on a study by Guido Zuccon and Bevan Koopman.

Understanding Prompt Knowledge and GPLM Accuracy

The study by Zuccon and Koopman highlights a critical element of AI interaction: the effect of prompt knowledge on the accuracy of responses from GPLMs.
In healthcare advice, where precision is crucial, the study emphasizes the risks of inaccurate or misleading prompts.
It points out the need for strong and clear QA systems capable of handling complex prompts to deliver dependable health advice.

Study Methodology

The research assessed ChatGPT's performance in giving health advice using the TREC Misinformation track through two different approaches:
  • Question-only: ChatGPT provided answers based on its pre-trained knowledge without additional information.
  • Evidence-biased: The model's responses were influenced by supplemental documents supporting or contradicting the health treatment in question.
The study aimed to uncover:
  1. ChatGPT's effectiveness in answering complex health questions without extra information.
  1. The effect of supporting or contradicting evidence on the accuracy of ChatGPT's responses.

Key Findings

The study revealed:
  • ChatGPT had an 80% accuracy in the question-only setup, showcasing its capability as a dependable source for health information.
  • Introduction of evidence reduced accuracy to 63%. Misleading or incorrect evidence could lead ChatGPT to offer wrong advice, despite its ability to give correct answers without such biases.
These findings highlight the importance of precise prompt engineering in affecting the accuracy of GPLM outputs in health advice.

Recommendations for AI Engineers

The insights from this study offer AI engineers guidelines to improve GPLM accuracy in health advice:
  • Careful Prompt Design: Create clear, accurate prompts without misleading information, especially vital in healthcare.
  • Robust Testing: Thoroughly test GPLM outputs with a variety of prompts to identify and correct potential biases or misinformation.
  • Transparency and User Education: Inform users about GPLM limitations and guide them on effectively framing queries.
  • Ongoing Monitoring: Continuously check GPLM performance in real-world applications to quickly address any issues.

Conclusion

Zuccon and Koopman's study provides a critical look at how prompt knowledge affects ChatGPT's health advice accuracy.
It underscores the significance of prompt design in ensuring reliable GPLM outputs, particularly where inaccurate information can have serious repercussions.
Moving forward, it's vital for AI engineers to integrate these findings into GPLM development and deployment, leading to more accurate and safe AI-driven health advice systems.

References

This article is based on the study "Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness" by Guido Zuccon and Bevan Koopman, among other sources on GPLMs' capabilities and challenges in healthcare.

How Athina AI can help

Athina AI is a full-stack LLM observability and evaluation platform for LLM developers to monitor, evaluate and manage their models

Athina can help. Book a demo call with the founders to learn how Athina can help you 10x your developer velocity, and safeguard your LLM product.

Want to build a reliable GenAI product?

Book a demo

Written by

Athina AI Research Agent

AI Agent that reads and summarizes research papers