Top

BMC Medical Informatics and Decision Making

Published in:

Open Access 01-12-2024 | Research

Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences

Authors: Yong Liu, Shenggen Ju, Junfeng Wang

Published in: BMC Medical Informatics and Decision Making | Issue 1/2024

Abstract

Background

Telemedicine has experienced rapid growth in recent years, aiming to enhance medical efficiency and reduce the workload of healthcare professionals. During the COVID-19 pandemic in 2019, it became especially crucial, enabling remote screenings and access to healthcare services while maintaining social distancing. Online consultation platforms have emerged, but the demand has strained the availability of medical professionals, directly leading to research and development in automated medical consultation. Specifically, there is a need for efficient and accurate medical dialogue summarization algorithms to condense lengthy conversations into shorter versions focused on relevant medical facts. The success of large language models like generative pre-trained transformer (GPT)-3 has recently prompted a paradigm shift in natural language processing (NLP) research. In this paper, we will explore its impact on medical dialogue summarization.

Methods

We present the performance and evaluation results of two approaches on a medical dialogue dataset. The first approach is based on fine-tuned pre-trained language models, such as bert-based summarization (BERTSUM) and bidirectional auto-regressive Transformers (BART). The second approach utilizes a large language models (LLMs) GPT-3.5 with inter-context learning (ICL). Evaluation is conducted using automated metrics such as ROUGE and BERTScore.

Results

In comparison to the BART and ChatGPT models, the summaries generated by the BERTSUM model not only exhibit significantly lower ROUGE and BERTScore values but also fail to pass the testing for any of the metrics in manual evaluation. On the other hand, the BART model achieved the highest ROUGE and BERTScore values among all evaluated models, surpassing ChatGPT. Its ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore values were 14.94%, 53.48%, 32.84%, and 6.73% higher respectively than ChatGPT’s best results. However, in the manual evaluation by medical experts, the summaries generated by the BART model exhibit satisfactory performance only in the “Readability” metric, with less than 30% passing the manual evaluation in other metrics. When compared to the BERTSUM and BART models, the ChatGPT model was evidently more favored by human medical experts.

Conclusion

On one hand, the GPT-3.5 model can manipulate the style and outcomes of medical dialogue summaries through various prompts. The generated content is not only better received than results from certain human experts but also more comprehensible, making it a promising avenue for automated medical dialogue summarization. On the other hand, automated evaluation mechanisms like ROUGE and BERTScore fall short in fully assessing the outputs of large language models like GPT-3.5. Therefore, it is necessary to research more appropriate evaluation criteria.

Jo HS, Park K, Jung SM. A scoping review of consumer needs for cancer information. Patient Educ Couns. 2019;102(7):1237–50.CrossRefPubMed

Finney Rutten LJ, Blake KD, Greenberg-Worisek AJ, Allen SV, Moser RP, Hesse BW. Online health information seeking among US adults: measuring progress toward a healthy people 2020 objective. Public Health Rep. 2019;134(6):617–25.CrossRefPubMedPubMedCentral

Jain R, Jangra A, Saha S, Jatowt A. A survey on medical document summarization. 2022. arXiv preprint arXiv:2212.01669

Navarro DF, Dras M, Berkovsky S. Few-shot fine-tuning SOTA summarization models for medical dialogues. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop. 2022. p. 254–266. https://aclanthology.org/2022.naacl-srw.32/.

Hollander JE, Carr BG. Virtually perfect? Telemedicine for COVID-19. N Engl J Med. 2020;382(18):1679–81.CrossRefPubMed

Mann DM, Chen J, Chunara R, Testa PA, Nov O. COVID-19 transforms health care through telemedicine: evidence from the field. J Am Med Inform Assoc. 2020;27(7):1132–5.CrossRefPubMedPubMedCentral

Liu Y. Fine-tune BERT for extractive summarization. 2019. arXiv preprint arXiv:1903.10318.

Song Y, Tian Y, Wang N, Xia F. Summarizing medical conversations via identifying important utterances. In: Proceedings of the 28th International Conference on Computational Linguistics. 2020. p. 717–29. https://aclanthology.org/2020.coling-main.63/.

Krishna K, Khosla S, Bigham JP, Lipton ZC. Generating SOAP notes from doctor-patient conversations using modular summarization techniques. 2020. arXiv preprint arXiv:2005.01795.

10.

Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, et al. fairseq: a fast, extensible toolkit for sequence modeling. 2019. arXiv preprint arXiv:1904.01038.

11.

Zhang L, Negrinho R, Ghosh A, Jagannathan V, Hassanzadeh HR, Schaaf T, et al. Leveraging pretrained models for automatic summarization of doctor-patient conversations. 2021. arXiv preprint arXiv:2109.12174.

12.

Michalopoulos G, Williams K, Singh G, Lin T. MedicalSum: A Guided Clinical Abstractive Summarization Model for Generating Medical Reports from Patient-Doctor Conversations. In: Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. p. 4741–4749.

13.

Joshi A, Katariya N, Amatriain X, Kannan A. Dr. summarize: Global summarization of medical dialogue by exploiting local structures. 2020. arXiv preprint arXiv:2009.08666.

14.

Mrini K, Dernoncourt F, Chang W, Farcas E, Nakashole N. Joint summarization-entailment optimization for consumer health question understanding. In: Proceedings of the Second Workshop on Natural Language Processing for Medical Conversations. 2021. p. 58–65. https://aclanthology.org/2021.nlpmc-1.8/.

15.

Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. 2018. arXiv preprint arXiv:1810.04805.

16.

Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, et al. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. 2019. arXiv preprint arXiv:1910.13461.

17.

Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I, et al. Language models are unsupervised multitask learners. OpenAI Blog. 2019;1(8):9.

18.

Ortega-Martín M, García-Sierra Ó, Ardoiz A, Álvarez J, Armenteros JC, Alonso A. Linguistic ambiguity analysis in ChatGPT. 2023. arXiv preprint arXiv:2302.06426.

19.

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst. 2022;35:27730–44.

20.

Giorgi J, Toma A, Xie R, Chen S, An KR, Zheng GX, et al. Clinical Note Generation from Doctor-Patient Conversations using Large Language Models: Insights from MEDIQA-Chat. 2023. arXiv preprint arXiv:2305.02220.

21.

Tang X, Tran A, Tan J, Gerstein M. GersteinLab at MEDIQA-Chat 2023: Clinical Note Summarization from Doctor-Patient Conversations through Fine-tuning and In-context Learning. 2023. arXiv preprint arXiv:2305.05001.

22.

Ma C, Wu Z, Wang J, Xu S, Wei Y, Liu Z, et al. ImpressionGPT: an iterative optimizing framework for radiology report summarization with chatGPT. 2023. arXiv preprint arXiv:2304.08448.

23.

Patel SB, Lam K. ChatGPT: the future of discharge summaries? Lancet Digit Health. 2023;5(3):e107–8.CrossRefPubMed

24.

Dong Q, Li L, Dai D, Zheng C, Wu Z, Chang B, et al. A survey for in-context learning. 2022. arXiv preprint arXiv:2301.00234.

25.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is All you Need. In: Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, editors. Advances in Neural Information Processing Systems, vol 30. Curran Associates, Inc.; 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

26.

Graves A. Generating sequences with recurrent neural networks. 2013. arXiv preprint arXiv:1308.0850.

27.

Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. Roberta: A robustly optimized bert pretraining approach. 2019. arXiv preprint arXiv:1907.11692.

28.

Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(1):5485–551.MathSciNet

29.

Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. 2019. arXiv preprint arXiv:1910.01108.

30.

Sun Z, Yu H, Song X, Liu R, Yang Y, Zhou D. Mobilebert: a compact task-agnostic bert for resource-limited devices. 2020. arXiv preprint arXiv:2004.02984.

31.

Zhang Z, Han X, Liu Z, Jiang X, Sun M, Liu Q. ERNIE: Enhanced language representation with informative entities. 2019. arXiv preprint arXiv:1905.07129.

32.

Moro G, Ragazzi L, Valgimigli L, Freddi D. Discriminative marginalized probabilistic neural method for multi-document summarization of medical literature. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. p. 180–9. https://cris.unibo.it/handle/11585/900380.

33.

Grail Q, Perez J, Gaussier E. Globalizing BERT-based transformer architectures for long document summarization. In: Proceedings of the 16th conference of the European chapter of the Association for Computational Linguistics: Main volume. 2021. p. 1792–810. https://aclanthology.org/2021.eacl-main.154/.

34.

Kieuvongngam V, Tan B, Niu Y. Automatic text summarization of covid-19 medical research articles using bert and gpt-2. 2020. arXiv preprint arXiv:2006.01997.

35.

Kanwal N, Rizzo G. Attention-based clinical note summarization. In: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing. 2022. p. 813–20. https://dl.acm.org/doi/abs/10.1145/3477314.3507256.

36.

DeYoung J, Beltagy I, van Zuylen M, Kuehl B, Wang LL. Ms2: Multi-document summarization of medical studies. 2021. arXiv preprint arXiv:2104.06486.

37.

Gupta S, Sharaff A, Nagwani NK. Biomedical text summarization: a graph-based ranking approach. In: Applied Information Processing Systems: Proceedings of ICCET 2021. Springer; 2022. p. 147–156.

38.

Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017. arXiv preprint arXiv:1707.06347.

39.

Hassani H, Silva ES. The role of ChatGPT in data science: how ai-assisted conversational interfaces are revolutionizing the field. Big Data Cogn Comput. 2023;7(2):62.CrossRef

40.

Lund BD, Wang T, Mannuru NR, Nie B, Shimray S, Wang Z. ChatGPT and a new academic reality: Artificial Intelligence-written research papers and the ethics of the large language models in scholarly publishing. J Assoc Inf Sci Technol. 2023;74(5):570–81.CrossRef

41.

Abdullah M, Madain A, Jararweh Y. ChatGPT: Fundamentals, applications and social impacts. In: 2022 Ninth International Conference on Social Networks Analysis, Management and Security (SNAMS). IEEE; 2022. p. 1–8.

42.

Baidoo-Anu D, Owusu Ansah L. Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of ChatGPT in promoting teaching and learning. J AI. 2023;7(1):52–62. https://dergipark.org.tr/en/pub/jai/issue/77844/1337500.

43.

Jiao W, Wang W, Huang Jt, Wang X, Tu Z. Is ChatGPT a good translator? A preliminary study. 2023. arXiv preprint arXiv:2301.08745.

44.

Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J Med Syst. 2023;47(1):33.CrossRefPubMedPubMedCentral

45.

Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595.CrossRefPubMedPubMedCentral

46.

Xue VW, Lei P, Cho WC. The potential impact of ChatGPT in clinical and translational medicine. Clin Transl Med. 2023;13(3). https://doi.org/10.1002/ctm2.1216.

47.

Elkassem AA, Smith AD. Potential use cases for ChatGPT in radiology reporting. Am J Roentgenol. 2023. https://doi.org/10.2214/AJR.23.29198.

48.

Shaib C, Li ML, Joseph S, Marshall IJ, Li JJ, Wallace BC. Summarizing, simplifying, and synthesizing medical evidence using gpt-3 (with varying success). 2023. arXiv preprint arXiv:2305.06299.

49.

Liu Y, Han T, Ma S, Zhang J, Yang Y, Tian J, et al. Summary of ChatGPT-Related research and perspective towards the future of large language models. 2023. arXiv preprint arXiv:2304.01852.

50.

Lin CY. Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out. Barcelona: Association for Computational Linguistics; 2004. p. 74–81. https://aclanthology.org/W04-1013.

51.

Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y. Bertscore: evaluating text generation with bert. 2019. arXiv preprint arXiv:1904.09675.

52.

Sebastian G. Do ChatGPT and other AI chatbots pose a cybersecurity risk?: An exploratory study. Int J Secur Priv Pervasive Comput. 2023;15(1):1–11.

53.

Li H, Guo D, Fan W, Xu M, Song Y. Multi-step jailbreaking privacy attacks on chatgpt. 2023. arXiv preprint arXiv:2304.05197.

54.

Renaud K, Warkentin M, Westerman G. From ChatGPT to HackGPT: Meeting the Cybersecurity Threat of Generative AI. MIT Sloan Management Review; 2023.

55.

Minssen T, Vayena E, Cohen IG. The challenges for regulating medical use of ChatGPT and other large language models. JAMA. 2023. https://doi.org/10.1001/jama.2023.9651.

Title: Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences
Authors: Yong Liu
Shenggen Ju
Junfeng Wang
Publication date: 01-12-2024
Publisher: BioMed Central
Published in: BMC Medical Informatics and Decision Making / Issue 1/2024
Electronic ISSN: 1472-6947
DOI: https://doi.org/10.1186/s12911-024-02481-8

Keynote webinar | Spotlight on medication adherence

Springer Medicine

Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences

Abstract

Background

Methods

Results

Conclusion

Keynote webinar | Spotlight on medication adherence

Springer Medicine

Abstract

Background

Methods

Results

Conclusion

Please log in to get access to this content

Other articles of this Issue 1/2024

Development and application of Chinese medical ontology for diabetes mellitus

A reliable diabetic retinopathy grading via transfer learning and ensemble learning with quadratic weighted kappa metric

Effect of early serum phosphate disorder on in-hospital and 28-day mortality in sepsis patients: a retrospective study based on MIMIC-IV database

Protein sequence analysis in the context of drug repurposing

Adolescent, parent, and provider attitudes toward a machine learning based clinical decision support system for selecting treatment for youth depression

Automated machine learning for early prediction of acute kidney injury in acute pancreatitis