OpenAI's GPT-4 matches doctors in eye assessments
A recent study conducted by the School of Clinical Medicine at the University of Cambridge has found that OpenAI's GPT-4 nearly equals the performance of expert ophthalmologists in eye assessments. The research, which was featured in PLOS Digital Health, tested several learning language models (LLMs) including GPT-4 and its predecessor GPT-3.5, Google's PaLM 2, and Meta's LLaMA using a non-publicly accessible textbook used for training ophthalmologists.
AI model and medical professionals tested on eye assessment
The study involved administering a test comprising 87 multiple-choice questions from a textbook used for training ophthalmologists. This test was given to both the learning language models (LLMs) and a group of medical professionals, which included five expert ophthalmologists, three trainee ophthalmologists, and two junior doctors from non-specialized fields. Notably, these LLMs were not believed to have been trained on these specific questions prior to the test.
GPT-4 surpasses trainees and junior doctors in test
ChatGPT, powered by either GPT-4 or GPT-3.5, was given three attempts to answer each question definitively; otherwise, its response was marked as null. The results were surprising. GPT-4 correctly answered 60 out of the 87 questions, significantly surpassing the junior doctors' average of 37 correct answers and marginally exceeding the trainees' average of 59.7. Interestingly, one expert ophthalmologist only managed to answer 56 questions accurately.
How expert ophthalmologists and other LLMs performed
Despite GPT-4's impressive performance, the group of five expert ophthalmologists averaged a score of 66.4 correct answers, slightly outdoing GPT-4. Other learning language models (LLMs) like Google's PaLM 2 scored a 49, while GPT-3.5 scored 42. Meta's LLaMA lagged behind with the lowest score at 28, even falling below the junior doctors' scores. These trials were conducted in mid-2023.
Researchers highlight risks and concerns with LLMs
Despite the promising results, the researchers pointed out several risks and concerns associated with learning language models (LLMs). The study was limited in terms of the number of questions, especially in certain categories, suggesting that actual results might differ. LLMs also have a tendency to "hallucinate" or fabricate information, which could lead to false claims about conditions like cataracts or cancer. Furthermore, these systems often lack nuance, creating additional opportunities for inaccuracies.