ChatGPT's Inconsistent Heart Risk Assessments Raise Concerns for Medical Use

PLOS ONE

May 1

A new study published in PLOS ONE has found that ChatGPT, despite its reported ability to pass medical exams, may not be reliable for certain health assessments, such as determining the urgency of a patient's chest pain. The research, conducted by Dr. Thomas Heston from Washington State University's Elson S. Floyd College of Medicine and Dr. Lawrence Lewis from Washington University in St. Louis, highlights the potential dangers of relying on the AI system for high-stakes clinical situations.

Inconsistent Results and Varying Risk Assessments

The researchers generated three datasets of 10,000 randomized, simulated cases each, using variables from the TIMI and HEART scales, which are traditionally used by physicians to assess heart risk. When presented with the same patient data, ChatGPT provided inconsistent conclusions, returning different risk assessment levels for identical cases.

"ChatGPT was not acting in a consistent manner," said lead author Dr. Thomas Heston. "Given the exact same data, ChatGPT would give a score of low risk, then next time an intermediate risk, and occasionally, it would go as far as giving a high risk."

The authors believe this inconsistency is likely due to the level of randomness built into the current version of the software, ChatGPT4, which helps it vary its responses to simulate natural language. However, this randomness does not work well for healthcare uses that require a single, consistent answer.

The Potential and Pitfalls of AI in Healthcare

Despite the negative findings of this study, Dr. Heston sees great potential for generative AI in healthcare, with further development. For instance, assuming privacy standards could be met, entire medical records could be loaded into the program, allowing doctors to quickly access the most pertinent facts about a patient in an emergency setting. Additionally, for difficult, complex cases, doctors could ask the program to generate several possible diagnoses.

"ChatGPT could be excellent at creating a differential diagnosis and that's probably one of its greatest strengths," said Heston. "If you don't quite know what's going on with a patient, you could ask it to give the top five diagnoses and the reasoning behind each one. So it could be good at helping you think through a problem, but it's not good at giving the answer."

The researchers emphasize the importance of conducting further research on AI systems like ChatGPT, especially in high-stakes clinical situations, to ensure their safe and effective use in healthcare.

"We found there was a lot of variation, and that variation in approach can be dangerous," Heston said. "It can be a useful tool, but I think the technology is going a lot faster than our understanding of it, so it's critically important that we do a lot of research."

Keyword/Phrase: ChatGPT Heart Risk Assessment