Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2024

Open Access 01-12-2024 | Research

Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences

Authors: Yong Liu, Shenggen Ju, Junfeng Wang

Published in: BMC Medical Informatics and Decision Making | Issue 1/2024

Login to get access

Abstract

Background

Telemedicine has experienced rapid growth in recent years, aiming to enhance medical efficiency and reduce the workload of healthcare professionals. During the COVID-19 pandemic in 2019, it became especially crucial, enabling remote screenings and access to healthcare services while maintaining social distancing. Online consultation platforms have emerged, but the demand has strained the availability of medical professionals, directly leading to research and development in automated medical consultation. Specifically, there is a need for efficient and accurate medical dialogue summarization algorithms to condense lengthy conversations into shorter versions focused on relevant medical facts. The success of large language models like generative pre-trained transformer (GPT)-3 has recently prompted a paradigm shift in natural language processing (NLP) research. In this paper, we will explore its impact on medical dialogue summarization.

Methods

We present the performance and evaluation results of two approaches on a medical dialogue dataset. The first approach is based on fine-tuned pre-trained language models, such as bert-based summarization (BERTSUM) and bidirectional auto-regressive Transformers (BART). The second approach utilizes a large language models (LLMs) GPT-3.5 with inter-context learning (ICL). Evaluation is conducted using automated metrics such as ROUGE and BERTScore.

Results

In comparison to the BART and ChatGPT models, the summaries generated by the BERTSUM model not only exhibit significantly lower ROUGE and BERTScore values but also fail to pass the testing for any of the metrics in manual evaluation. On the other hand, the BART model achieved the highest ROUGE and BERTScore values among all evaluated models, surpassing ChatGPT. Its ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore values were 14.94%, 53.48%, 32.84%, and 6.73% higher respectively than ChatGPT’s best results. However, in the manual evaluation by medical experts, the summaries generated by the BART model exhibit satisfactory performance only in the “Readability” metric, with less than 30% passing the manual evaluation in other metrics. When compared to the BERTSUM and BART models, the ChatGPT model was evidently more favored by human medical experts.

Conclusion

On one hand, the GPT-3.5 model can manipulate the style and outcomes of medical dialogue summaries through various prompts. The generated content is not only better received than results from certain human experts but also more comprehensible, making it a promising avenue for automated medical dialogue summarization. On the other hand, automated evaluation mechanisms like ROUGE and BERTScore fall short in fully assessing the outputs of large language models like GPT-3.5. Therefore, it is necessary to research more appropriate evaluation criteria.
Literature
1.
go back to reference Jo HS, Park K, Jung SM. A scoping review of consumer needs for cancer information. Patient Educ Couns. 2019;102(7):1237–50.CrossRefPubMed Jo HS, Park K, Jung SM. A scoping review of consumer needs for cancer information. Patient Educ Couns. 2019;102(7):1237–50.CrossRefPubMed
2.
go back to reference Finney Rutten LJ, Blake KD, Greenberg-Worisek AJ, Allen SV, Moser RP, Hesse BW. Online health information seeking among US adults: measuring progress toward a healthy people 2020 objective. Public Health Rep. 2019;134(6):617–25.CrossRefPubMedPubMedCentral Finney Rutten LJ, Blake KD, Greenberg-Worisek AJ, Allen SV, Moser RP, Hesse BW. Online health information seeking among US adults: measuring progress toward a healthy people 2020 objective. Public Health Rep. 2019;134(6):617–25.CrossRefPubMedPubMedCentral
4.
go back to reference Navarro DF, Dras M, Berkovsky S. Few-shot fine-tuning SOTA summarization models for medical dialogues. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop. 2022. p. 254–266. https://aclanthology.org/2022.naacl-srw.32/. Navarro DF, Dras M, Berkovsky S. Few-shot fine-tuning SOTA summarization models for medical dialogues. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop. 2022. p. 254–266. https://​aclanthology.​org/​2022.​naacl-srw.​32/​.
5.
go back to reference Hollander JE, Carr BG. Virtually perfect? Telemedicine for COVID-19. N Engl J Med. 2020;382(18):1679–81.CrossRefPubMed Hollander JE, Carr BG. Virtually perfect? Telemedicine for COVID-19. N Engl J Med. 2020;382(18):1679–81.CrossRefPubMed
6.
go back to reference Mann DM, Chen J, Chunara R, Testa PA, Nov O. COVID-19 transforms health care through telemedicine: evidence from the field. J Am Med Inform Assoc. 2020;27(7):1132–5.CrossRefPubMedPubMedCentral Mann DM, Chen J, Chunara R, Testa PA, Nov O. COVID-19 transforms health care through telemedicine: evidence from the field. J Am Med Inform Assoc. 2020;27(7):1132–5.CrossRefPubMedPubMedCentral
9.
go back to reference Krishna K, Khosla S, Bigham JP, Lipton ZC. Generating SOAP notes from doctor-patient conversations using modular summarization techniques. 2020. arXiv preprint arXiv:2005.01795. Krishna K, Khosla S, Bigham JP, Lipton ZC. Generating SOAP notes from doctor-patient conversations using modular summarization techniques. 2020. arXiv preprint arXiv:​2005.​01795.
10.
go back to reference Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, et al. fairseq: a fast, extensible toolkit for sequence modeling. 2019. arXiv preprint arXiv:1904.01038. Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, et al. fairseq: a fast, extensible toolkit for sequence modeling. 2019. arXiv preprint arXiv:​1904.​01038.
11.
go back to reference Zhang L, Negrinho R, Ghosh A, Jagannathan V, Hassanzadeh HR, Schaaf T, et al. Leveraging pretrained models for automatic summarization of doctor-patient conversations. 2021. arXiv preprint arXiv:2109.12174. Zhang L, Negrinho R, Ghosh A, Jagannathan V, Hassanzadeh HR, Schaaf T, et al. Leveraging pretrained models for automatic summarization of doctor-patient conversations. 2021. arXiv preprint arXiv:​2109.​12174.
12.
go back to reference Michalopoulos G, Williams K, Singh G, Lin T. MedicalSum: A Guided Clinical Abstractive Summarization Model for Generating Medical Reports from Patient-Doctor Conversations. In: Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. p. 4741–4749. Michalopoulos G, Williams K, Singh G, Lin T. MedicalSum: A Guided Clinical Abstractive Summarization Model for Generating Medical Reports from Patient-Doctor Conversations. In: Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. p. 4741–4749.
13.
go back to reference Joshi A, Katariya N, Amatriain X, Kannan A. Dr. summarize: Global summarization of medical dialogue by exploiting local structures. 2020. arXiv preprint arXiv:2009.08666. Joshi A, Katariya N, Amatriain X, Kannan A. Dr. summarize: Global summarization of medical dialogue by exploiting local structures. 2020. arXiv preprint arXiv:​2009.​08666.
14.
go back to reference Mrini K, Dernoncourt F, Chang W, Farcas E, Nakashole N. Joint summarization-entailment optimization for consumer health question understanding. In: Proceedings of the Second Workshop on Natural Language Processing for Medical Conversations. 2021. p. 58–65. https://aclanthology.org/2021.nlpmc-1.8/. Mrini K, Dernoncourt F, Chang W, Farcas E, Nakashole N. Joint summarization-entailment optimization for consumer health question understanding. In: Proceedings of the Second Workshop on Natural Language Processing for Medical Conversations. 2021. p. 58–65. https://​aclanthology.​org/​2021.​nlpmc-1.​8/​.
15.
go back to reference Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. 2018. arXiv preprint arXiv:1810.04805. Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. 2018. arXiv preprint arXiv:​1810.​04805.
16.
go back to reference Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, et al. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. 2019. arXiv preprint arXiv:1910.13461. Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, et al. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. 2019. arXiv preprint arXiv:​1910.​13461.
17.
go back to reference Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I, et al. Language models are unsupervised multitask learners. OpenAI Blog. 2019;1(8):9. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I, et al. Language models are unsupervised multitask learners. OpenAI Blog. 2019;1(8):9.
18.
go back to reference Ortega-Martín M, García-Sierra Ó, Ardoiz A, Álvarez J, Armenteros JC, Alonso A. Linguistic ambiguity analysis in ChatGPT. 2023. arXiv preprint arXiv:2302.06426. Ortega-Martín M, García-Sierra Ó, Ardoiz A, Álvarez J, Armenteros JC, Alonso A. Linguistic ambiguity analysis in ChatGPT. 2023. arXiv preprint arXiv:​2302.​06426.
19.
go back to reference Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst. 2022;35:27730–44. Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst. 2022;35:27730–44.
20.
go back to reference Giorgi J, Toma A, Xie R, Chen S, An KR, Zheng GX, et al. Clinical Note Generation from Doctor-Patient Conversations using Large Language Models: Insights from MEDIQA-Chat. 2023. arXiv preprint arXiv:2305.02220. Giorgi J, Toma A, Xie R, Chen S, An KR, Zheng GX, et al. Clinical Note Generation from Doctor-Patient Conversations using Large Language Models: Insights from MEDIQA-Chat. 2023. arXiv preprint arXiv:​2305.​02220.
21.
go back to reference Tang X, Tran A, Tan J, Gerstein M. GersteinLab at MEDIQA-Chat 2023: Clinical Note Summarization from Doctor-Patient Conversations through Fine-tuning and In-context Learning. 2023. arXiv preprint arXiv:2305.05001. Tang X, Tran A, Tan J, Gerstein M. GersteinLab at MEDIQA-Chat 2023: Clinical Note Summarization from Doctor-Patient Conversations through Fine-tuning and In-context Learning. 2023. arXiv preprint arXiv:​2305.​05001.
22.
go back to reference Ma C, Wu Z, Wang J, Xu S, Wei Y, Liu Z, et al. ImpressionGPT: an iterative optimizing framework for radiology report summarization with chatGPT. 2023. arXiv preprint arXiv:2304.08448. Ma C, Wu Z, Wang J, Xu S, Wei Y, Liu Z, et al. ImpressionGPT: an iterative optimizing framework for radiology report summarization with chatGPT. 2023. arXiv preprint arXiv:​2304.​08448.
23.
go back to reference Patel SB, Lam K. ChatGPT: the future of discharge summaries? Lancet Digit Health. 2023;5(3):e107–8.CrossRefPubMed Patel SB, Lam K. ChatGPT: the future of discharge summaries? Lancet Digit Health. 2023;5(3):e107–8.CrossRefPubMed
24.
27.
go back to reference Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. Roberta: A robustly optimized bert pretraining approach. 2019. arXiv preprint arXiv:1907.11692. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. Roberta: A robustly optimized bert pretraining approach. 2019. arXiv preprint arXiv:​1907.​11692.
28.
go back to reference Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(1):5485–551.MathSciNet Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(1):5485–551.MathSciNet
29.
go back to reference Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. 2019. arXiv preprint arXiv:1910.01108. Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. 2019. arXiv preprint arXiv:​1910.​01108.
30.
go back to reference Sun Z, Yu H, Song X, Liu R, Yang Y, Zhou D. Mobilebert: a compact task-agnostic bert for resource-limited devices. 2020. arXiv preprint arXiv:2004.02984. Sun Z, Yu H, Song X, Liu R, Yang Y, Zhou D. Mobilebert: a compact task-agnostic bert for resource-limited devices. 2020. arXiv preprint arXiv:​2004.​02984.
31.
go back to reference Zhang Z, Han X, Liu Z, Jiang X, Sun M, Liu Q. ERNIE: Enhanced language representation with informative entities. 2019. arXiv preprint arXiv:1905.07129. Zhang Z, Han X, Liu Z, Jiang X, Sun M, Liu Q. ERNIE: Enhanced language representation with informative entities. 2019. arXiv preprint arXiv:​1905.​07129.
32.
go back to reference Moro G, Ragazzi L, Valgimigli L, Freddi D. Discriminative marginalized probabilistic neural method for multi-document summarization of medical literature. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. p. 180–9. https://cris.unibo.it/handle/11585/900380. Moro G, Ragazzi L, Valgimigli L, Freddi D. Discriminative marginalized probabilistic neural method for multi-document summarization of medical literature. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. p. 180–9. https://​cris.​unibo.​it/​handle/​11585/​900380.
33.
go back to reference Grail Q, Perez J, Gaussier E. Globalizing BERT-based transformer architectures for long document summarization. In: Proceedings of the 16th conference of the European chapter of the Association for Computational Linguistics: Main volume. 2021. p. 1792–810. https://aclanthology.org/2021.eacl-main.154/. Grail Q, Perez J, Gaussier E. Globalizing BERT-based transformer architectures for long document summarization. In: Proceedings of the 16th conference of the European chapter of the Association for Computational Linguistics: Main volume. 2021. p. 1792–810. https://​aclanthology.​org/​2021.​eacl-main.​154/​.
34.
go back to reference Kieuvongngam V, Tan B, Niu Y. Automatic text summarization of covid-19 medical research articles using bert and gpt-2. 2020. arXiv preprint arXiv:2006.01997. Kieuvongngam V, Tan B, Niu Y. Automatic text summarization of covid-19 medical research articles using bert and gpt-2. 2020. arXiv preprint arXiv:​2006.​01997.
36.
go back to reference DeYoung J, Beltagy I, van Zuylen M, Kuehl B, Wang LL. Ms2: Multi-document summarization of medical studies. 2021. arXiv preprint arXiv:2104.06486. DeYoung J, Beltagy I, van Zuylen M, Kuehl B, Wang LL. Ms2: Multi-document summarization of medical studies. 2021. arXiv preprint arXiv:​2104.​06486.
37.
go back to reference Gupta S, Sharaff A, Nagwani NK. Biomedical text summarization: a graph-based ranking approach. In: Applied Information Processing Systems: Proceedings of ICCET 2021. Springer; 2022. p. 147–156. Gupta S, Sharaff A, Nagwani NK. Biomedical text summarization: a graph-based ranking approach. In: Applied Information Processing Systems: Proceedings of ICCET 2021. Springer; 2022. p. 147–156.
38.
39.
go back to reference Hassani H, Silva ES. The role of ChatGPT in data science: how ai-assisted conversational interfaces are revolutionizing the field. Big Data Cogn Comput. 2023;7(2):62.CrossRef Hassani H, Silva ES. The role of ChatGPT in data science: how ai-assisted conversational interfaces are revolutionizing the field. Big Data Cogn Comput. 2023;7(2):62.CrossRef
40.
go back to reference Lund BD, Wang T, Mannuru NR, Nie B, Shimray S, Wang Z. ChatGPT and a new academic reality: Artificial Intelligence-written research papers and the ethics of the large language models in scholarly publishing. J Assoc Inf Sci Technol. 2023;74(5):570–81.CrossRef Lund BD, Wang T, Mannuru NR, Nie B, Shimray S, Wang Z. ChatGPT and a new academic reality: Artificial Intelligence-written research papers and the ethics of the large language models in scholarly publishing. J Assoc Inf Sci Technol. 2023;74(5):570–81.CrossRef
41.
go back to reference Abdullah M, Madain A, Jararweh Y. ChatGPT: Fundamentals, applications and social impacts. In: 2022 Ninth International Conference on Social Networks Analysis, Management and Security (SNAMS). IEEE; 2022. p. 1–8. Abdullah M, Madain A, Jararweh Y. ChatGPT: Fundamentals, applications and social impacts. In: 2022 Ninth International Conference on Social Networks Analysis, Management and Security (SNAMS). IEEE; 2022. p. 1–8.
43.
44.
go back to reference Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J Med Syst. 2023;47(1):33.CrossRefPubMedPubMedCentral Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J Med Syst. 2023;47(1):33.CrossRefPubMedPubMedCentral
45.
go back to reference Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595.CrossRefPubMedPubMedCentral Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595.CrossRefPubMedPubMedCentral
48.
go back to reference Shaib C, Li ML, Joseph S, Marshall IJ, Li JJ, Wallace BC. Summarizing, simplifying, and synthesizing medical evidence using gpt-3 (with varying success). 2023. arXiv preprint arXiv:2305.06299. Shaib C, Li ML, Joseph S, Marshall IJ, Li JJ, Wallace BC. Summarizing, simplifying, and synthesizing medical evidence using gpt-3 (with varying success). 2023. arXiv preprint arXiv:​2305.​06299.
49.
go back to reference Liu Y, Han T, Ma S, Zhang J, Yang Y, Tian J, et al. Summary of ChatGPT-Related research and perspective towards the future of large language models. 2023. arXiv preprint arXiv:2304.01852. Liu Y, Han T, Ma S, Zhang J, Yang Y, Tian J, et al. Summary of ChatGPT-Related research and perspective towards the future of large language models. 2023. arXiv preprint arXiv:​2304.​01852.
51.
go back to reference Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y. Bertscore: evaluating text generation with bert. 2019. arXiv preprint arXiv:1904.09675. Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y. Bertscore: evaluating text generation with bert. 2019. arXiv preprint arXiv:​1904.​09675.
52.
go back to reference Sebastian G. Do ChatGPT and other AI chatbots pose a cybersecurity risk?: An exploratory study. Int J Secur Priv Pervasive Comput. 2023;15(1):1–11. Sebastian G. Do ChatGPT and other AI chatbots pose a cybersecurity risk?: An exploratory study. Int J Secur Priv Pervasive Comput. 2023;15(1):1–11.
53.
54.
go back to reference Renaud K, Warkentin M, Westerman G. From ChatGPT to HackGPT: Meeting the Cybersecurity Threat of Generative AI. MIT Sloan Management Review; 2023. Renaud K, Warkentin M, Westerman G. From ChatGPT to HackGPT: Meeting the Cybersecurity Threat of Generative AI. MIT Sloan Management Review; 2023.
Metadata
Title
Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences
Authors
Yong Liu
Shenggen Ju
Junfeng Wang
Publication date
01-12-2024
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2024
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-024-02481-8

Other articles of this Issue 1/2024

BMC Medical Informatics and Decision Making 1/2024 Go to the issue