The more prompts given to the artificial intelligence tool ChatGPT, the less reliable it becomes, according to an Australian study.
A world-first study by the CSIRO and University of Queensland found that when ChatGPT is asked a health-related question, giving it more details reduced the accuracy of its responses to as low as 28%.
ChatGPT launched on in November 2022, and has quickly become one of the most widely used large language models (LLMs) — a form of artificial intelligence that can recognise, translate, summarise, predict and generate text.
But as they explode in popularity, there is also a growing concern that they pose a potential risk to the increasing number of people using online tools for key health information.
Scientists from CSIRO, Australia’s national science agency, and the University of Queensland explored a hypothetical scenario of an average person (non-professional health consumer) asking ChatGPT if ‘X’ treatment has a positive effect on condition ‘Y.’
The 100 questions presented ranged from ‘Can zinc help treat the common cold?’ to ‘Will drinking vinegar dissolve a stuck fish bone?’ and ChatGPT’s response was compared to the known correct response, or ‘ground truth,’ based on existing medical knowledge.
CSIRO Principal Research Scientist and Associate Professor at UQ, Dr Bevan Koopman, said even though the risks of searching for health information online were well documented, people continued to seek health information online, and increasingly via tools such as ChatGPT.
“The widespread popularity of using LLMs online for answers on people’s health is why we need continued research to inform the public about risks and to help them optimise the accuracy of their answers,” Dr Koopman said.
“While LLMs have the potential to greatly improve the way people access information, we need more research to understand where they are effective and where they are not.”
The study looked at two question formats: the first was a question only, while the second was a question biased with supporting or contrary evidence.
Results revealed that ChatGPT was quite good at giving accurate answers in a question-only format, with an 80% accuracy.
However, when the language model was given an evidence-biased prompt, accuracy reduced to 63%. Accuracy was reduced again to 28% when an “unsure” answer was allowed, and the researchers highlighted that this finding was contrary to the common belief that prompting with evidence improved accuracy.
“We are not sure why this happens. But given this occurs whether the evidence given is correct or not, perhaps the evidence adds too much noise, thus lowering accuracy,” Dr Koopman said.
Study co-author, UQ Professor Guido Zuccon, the Director of AI for the Queensland Digital Health Centre, said major search engines are now integrating LLMs and search technologies in a process called retrieval augmented generation.
“We demonstrate that the interaction between the LLM and the search component was still poorly understood and controllable, resulting in the generation of inaccurate health information,” Professor Zuccon said.
The study was recently presented at the Empirical Methods in Natural Language Processing Conference, where the team explained that their next research focus was to investigate how the public uses the health information generated by LLMs.