Skip to main content
Top
Published in:

13-11-2024 | Original Article

Role of visual information in multimodal large language model performance: an evaluation using the Japanese nuclear medicine board examination

Authors: Takashi Watanabe, Akira Baba, Takeshi Fukuda, Ken Watanabe, Jun Woo, Hiroya Ojiri

Published in: Annals of Nuclear Medicine | Issue 2/2025

Login to get access

Abstract

Objectives

This study aimed to assess the performance of state-of-the-art multimodal large language models (LLMs), specifically GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, on Japanese Nuclear Medicine Board Examination (JNMBE) questions and to evaluate the influence of visual information on the decision-making process.

Methods

This study utilized 92 questions with images from the JNMBE (2019–2023). The LLMs’ responses were assessed under two conditions: providing both text and images and providing only text. Each model answered all questions thrice, and the most frequent answer choice was considered the final answer. The accuracy and agreement rates among the model answers were evaluated using statistical tests.

Results

GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro exhibited no significant differences in terms of accuracy between the text-and-image and text-only conditions. GPT-4o and Claude 3 Opus demonstrated accuracies of 54.3% (95% CI: 44.2%–64.1%) each when provided with both text and images; however, they selected the same options as in the text-only condition for 71.7% of the questions. Gemini 1.5 Pro performed significantly worse than GPT-4o under text and image conditions. The agreement rates among the model answers ranged from weak to moderate.

Conclusion

The influence of images on decision-making in nuclear medicine is limited to the latest multimodal LLMs, and their diagnostic ability in this highly specialized field remains insufficient. Improving the utilization of image information and enhancing the answer reproducibility are crucial for the effective application of LLMs in nuclear medicine education and practice. Further advancements in these areas are necessary to harness the potential of LLMs as assistants in nuclear medicine diagnosis.
Literature
1.
go back to reference Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930–40.CrossRefPubMed Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930–40.CrossRefPubMed
2.
go back to reference Tanaka Y, Nakata T, Aiga K, Etani T, Muramatsu R, Katagiri S, et al. Performance of generative pretrained transformer on the national medical licensing examination in Japan. PLOS Digit Health. 2024;3(1): e0000433.CrossRefPubMedPubMedCentral Tanaka Y, Nakata T, Aiga K, Etani T, Muramatsu R, Katagiri S, et al. Performance of generative pretrained transformer on the national medical licensing examination in Japan. PLOS Digit Health. 2024;3(1): e0000433.CrossRefPubMedPubMedCentral
3.
go back to reference Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2): e0000198.CrossRefPubMedPubMedCentral Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2): e0000198.CrossRefPubMedPubMedCentral
4.
go back to reference Gobira M, Nakayama LF, Moreira R, Andrade E, Regatieri CVS, Belfort R. Performance of ChatGPT-4 in answering questions from the Brazilian national examination for medical degree revalidation. Rev Assoc Med Bras. 2023;10:e20230848.CrossRef Gobira M, Nakayama LF, Moreira R, Andrade E, Regatieri CVS, Belfort R. Performance of ChatGPT-4 in answering questions from the Brazilian national examination for medical degree revalidation. Rev Assoc Med Bras. 2023;10:e20230848.CrossRef
6.
go back to reference Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: Insights into current strengths and limitations. Radiology. 2023;307(5): e230582.CrossRefPubMed Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: Insights into current strengths and limitations. Radiology. 2023;307(5): e230582.CrossRefPubMed
10.
go back to reference Gemini Team, Georgiev P, Lei VI, Burnell R, Bai L, Gulati A, et al. (2024) Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context [Internet]. arXiv [cs.CL]. 2024 [cited 2024 May 24]. Available from: http://arxiv.org/abs/2403.05530 Gemini Team, Georgiev P, Lei VI, Burnell R, Bai L, Gulati A, et al. (2024) Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context [Internet]. arXiv [cs.CL]. 2024 [cited 2024 May 24]. Available from: http://​arxiv.​org/​abs/​2403.​05530
11.
12.
go back to reference Wang X, Wei J, Schuurmans D, Le Q, Chi E, Narang S, et al. (2022) Self-consistency improves chain of thought reasoning in language models [Internet]. arXiv [cs.CL]. 2022 [cited 2024 Jun 17]. Available from: http://arxiv.org/abs/2203.11171 Wang X, Wei J, Schuurmans D, Le Q, Chi E, Narang S, et al. (2022) Self-consistency improves chain of thought reasoning in language models [Internet]. arXiv [cs.CL]. 2022 [cited 2024 Jun 17]. Available from: http://​arxiv.​org/​abs/​2203.​11171
13.
14.
go back to reference Toyama Y, Harigai A, Abe M, Nagano M, Kawabata M, Seki Y, et al. Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society. Jpn J Radiol. 2024;42(2):201–7.CrossRefPubMed Toyama Y, Harigai A, Abe M, Nagano M, Kawabata M, Seki Y, et al. Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society. Jpn J Radiol. 2024;42(2):201–7.CrossRefPubMed
15.
go back to reference Krishna S, Bhambra N, Bleakney R, Bhayana R. Evaluation of Reliability Repeatability, robustness, and confidence of GPT-3.5 and GPT-4 on a radiology board-style examination. Radiology. 2024;311(2):e232715.CrossRefPubMed Krishna S, Bhambra N, Bleakney R, Bhayana R. Evaluation of Reliability Repeatability, robustness, and confidence of GPT-3.5 and GPT-4 on a radiology board-style examination. Radiology. 2024;311(2):e232715.CrossRefPubMed
Metadata
Title
Role of visual information in multimodal large language model performance: an evaluation using the Japanese nuclear medicine board examination
Authors
Takashi Watanabe
Akira Baba
Takeshi Fukuda
Ken Watanabe
Jun Woo
Hiroya Ojiri
Publication date
13-11-2024
Publisher
Springer Nature Singapore
Published in
Annals of Nuclear Medicine / Issue 2/2025
Print ISSN: 0914-7187
Electronic ISSN: 1864-6433
DOI
https://doi.org/10.1007/s12149-024-01992-8