Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2023

Open Access 01-12-2023 | Addiction | Research

Swin Unet3D: a three-dimensional medical image segmentation network combining vision transformer and convolution

Authors: Yimin Cai, Yuqing Long, Zhenggong Han, Mingkun Liu, Yuchen Zheng, Wei Yang, Liming Chen

Published in: BMC Medical Informatics and Decision Making | Issue 1/2023

Login to get access

Abstract

Background

Semantic segmentation of brain tumors plays a critical role in clinical treatment, especially for three-dimensional (3D) magnetic resonance imaging, which is often used in clinical practice. Automatic segmentation of the 3D structure of brain tumors can quickly help physicians understand the properties of tumors, such as the shape and size, thus improving the efficiency of preoperative planning and the odds of successful surgery. In past decades, 3D convolutional neural networks (CNNs) have dominated automatic segmentation methods for 3D medical images, and these network structures have achieved good results. However, to reduce the number of neural network parameters, practitioners ensure that the size of convolutional kernels in 3D convolutional operations generally does not exceed \(7 \times 7 \times 7\), which also leads to CNNs showing limitations in learning long-distance dependent information. Vision Transformer (ViT) is very good at learning long-distance dependent information in images, but it suffers from the problems of many parameters. What’s worse, the ViT cannot learn local dependency information in the previous layers under the condition of insufficient data. However, in the image segmentation task, being able to learn this local dependency information in the previous layers makes a big impact on the performance of the model.

Methods

This paper proposes the Swin Unet3D model, which represents voxel segmentation on medical images as a sequence-to-sequence prediction. The feature extraction sub-module in the model is designed as a parallel structure of Convolution and ViT so that all layers of the model are able to adequately learn both global and local dependency information in the image.

Results

On the validation dataset of Brats2021, our proposed model achieves dice coefficients of 0.840, 0.874, and 0.911 on the ET channel, TC channel, and WT channel, respectively. On the validation dataset of Brats2018, our model achieves dice coefficients of 0.716, 0.761, and 0.874 on the corresponding channels, respectively.

Conclusion

We propose a new segmentation model that combines the advantages of Vision Transformer and Convolution and achieves a better balance between the number of model parameters and segmentation accuracy. The code can be found at https://​github.​com/​1152545264/​SwinUnet3D.
Literature
2.
go back to reference Taghanaki SA, Abhishek K, Cohen JP, Cohen-Adad J, Hamarneh G. Deep semantic segmentation of natural and medical images: a review. Artif Intell Rev. 2021;54(1):137–78.CrossRef Taghanaki SA, Abhishek K, Cohen JP, Cohen-Adad J, Hamarneh G. Deep semantic segmentation of natural and medical images: a review. Artif Intell Rev. 2021;54(1):137–78.CrossRef
3.
go back to reference Bhargavi K, Jyothi S. A survey on threshold based segmentation technique in image processing. Int J Innov Res Dev. 2014;3(12):234–9. Bhargavi K, Jyothi S. A survey on threshold based segmentation technique in image processing. Int J Innov Res Dev. 2014;3(12):234–9.
4.
go back to reference Kaganami HG, Beiji Z. In: 2009 Fifth international conference on intelligent information hiding and multimedia signal processing (IEEE). 2009; p. 1217–21. Kaganami HG, Beiji Z. In: 2009 Fifth international conference on intelligent information hiding and multimedia signal processing (IEEE). 2009; p. 1217–21.
5.
go back to reference Unser M. Texture classification and segmentation using wavelet frames. IEEE Trans Image Process. 1995;4(11):1549–60.CrossRefPubMed Unser M. Texture classification and segmentation using wavelet frames. IEEE Trans Image Process. 1995;4(11):1549–60.CrossRefPubMed
6.
go back to reference Manjunath B, Chellappa R. Unsupervised texture segmentation using Markov random field models. IEEE Trans Pattern Anal Mach Intell. 1991;13(5):478–82.CrossRef Manjunath B, Chellappa R. Unsupervised texture segmentation using Markov random field models. IEEE Trans Pattern Anal Mach Intell. 1991;13(5):478–82.CrossRef
7.
go back to reference Paulinas M, Ušinskas A. A survey of genetic algorithms applications for image enhancement and segmentation. Inf Technol Control. 2007;36(3):66. Paulinas M, Ušinskas A. A survey of genetic algorithms applications for image enhancement and segmentation. Inf Technol Control. 2007;36(3):66.
8.
go back to reference Ronneberger O, Fischer P, Brox T. In: International conference on medical image computing and computer-assisted intervention. Springer; 2015;pp. 234–41. Ronneberger O, Fischer P, Brox T. In: International conference on medical image computing and computer-assisted intervention. Springer; 2015;pp. 234–41.
9.
go back to reference Çiçek Ö, Abdulkadir A, Lienkamp SS, Brox T, Ronneberger O. In: International conference on medical image computing and computer-assisted intervention. Springer; 2016; pp. 424–32. Çiçek Ö, Abdulkadir A, Lienkamp SS, Brox T, Ronneberger O. In: International conference on medical image computing and computer-assisted intervention. Springer; 2016; pp. 424–32.
10.
go back to reference Odena A, Dumoulin V, Olah C. Deconvolution and checkerboard artifacts. Distill. 2016;1(10): e3.CrossRef Odena A, Dumoulin V, Olah C. Deconvolution and checkerboard artifacts. Distill. 2016;1(10): e3.CrossRef
12.
go back to reference Long J, Shelhamer E, Darrell T. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 3431–40. Long J, Shelhamer E, Darrell T. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 3431–40.
13.
go back to reference Isensee F, Jaeger PF, Kohl SA, Petersen J, Maier-Hein KH. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods. 2021;18(2):203–11.CrossRefPubMed Isensee F, Jaeger PF, Kohl SA, Petersen J, Maier-Hein KH. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods. 2021;18(2):203–11.CrossRefPubMed
14.
go back to reference Milletari F, Navab N, Ahmadi SA. In: 2016 Fourth international conference on 3D vision (3DV). IEEE; 2016. p. 565–71. Milletari F, Navab N, Ahmadi SA. In: 2016 Fourth international conference on 3D vision (3DV). IEEE; 2016. p. 565–71.
15.
go back to reference Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. In: Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer; 2018. p. 3–11. Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. In: Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer; 2018. p. 3–11.
16.
go back to reference Oktay O, Schlemper J, Folgoc LL, Lee M, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla NY, Kainz B, et al. Attention u-net: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999. 2018. Oktay O, Schlemper J, Folgoc LL, Lee M, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla NY, Kainz B, et al. Attention u-net: learning where to look for the pancreas. arXiv preprint arXiv:​1804.​03999. 2018.
17.
go back to reference Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y, Han X, Chen YW, Wu J. In: ICASSP 2020—2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2020. p. 1055–9. Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y, Han X, Chen YW, Wu J. In: ICASSP 2020—2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2020. p. 1055–9.
19.
go back to reference Huang C, Han H, Yao Q, Zhu S, Zhou SK. In: MICCAI; 2019. Huang C, Han H, Yao Q, Zhu S, Zhou SK. In: MICCAI; 2019.
21.
go back to reference Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y. Transunet: transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306. 2021. Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y. Transunet: transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:​2102.​04306. 2021.
22.
go back to reference Raghu M, Unterthiner T, Kornblith S, Zhang C, Dosovitskiy A. Do vision transformers see like convolutional neural networks? Adv Neural Inf Process Syst. 2021;34:12116–28. Raghu M, Unterthiner T, Kornblith S, Zhang C, Dosovitskiy A. Do vision transformers see like convolutional neural networks? Adv Neural Inf Process Syst. 2021;34:12116–28.
24.
go back to reference Woo S, Park J, Lee JY, Kweon IS. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 3–19. Woo S, Park J, Lee JY, Kweon IS. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 3–19.
25.
go back to reference Zhao H, Shi J, Qi X, Wang X, Jia J. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 2881–90. Zhao H, Shi J, Qi X, Wang X, Jia J. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 2881–90.
26.
go back to reference Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. In: Advances in neural information processing systems; 2017. p. 5998–6008. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. In: Advances in neural information processing systems; 2017. p. 5998–6008.
27.
go back to reference Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. 2020. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:​2010.​11929. 2020.
28.
go back to reference Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030. 2021. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:​2103.​14030. 2021.
29.
go back to reference Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M. Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv preprint arXiv:2105.05537. 2021. Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M. Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv preprint arXiv:​2105.​05537. 2021.
30.
go back to reference Wang W, Chen C, Ding M, Yu H, Zha S, Li J. In: International conference on medical image computing and computer-assisted intervention. Springer; 2021. p. 109–19. Wang W, Chen C, Ding M, Yu H, Zha S, Li J. In: International conference on medical image computing and computer-assisted intervention. Springer; 2021. p. 109–19.
31.
go back to reference Hatamizadeh A, Tang Y, Nath V, Yang D, Myronenko A, Landman B, Roth HR, Xu D. In: Proceedings of the IEEE/CVF Winter conference on applications of computer vision; 2022. p. 574–84. Hatamizadeh A, Tang Y, Nath V, Yang D, Myronenko A, Landman B, Roth HR, Xu D. In: Proceedings of the IEEE/CVF Winter conference on applications of computer vision; 2022. p. 574–84.
33.
go back to reference Baid U, Ghodasara S, Mohan S, Bilello M, Calabrese E, Colak E, Farahani K, Kalpathy-Cramer J, Kitamura FC, Pati S, et al. The rsna–asnr–miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv preprint arXiv:2107.02314. 2021. Baid U, Ghodasara S, Mohan S, Bilello M, Calabrese E, Colak E, Farahani K, Kalpathy-Cramer J, Kitamura FC, Pati S, et al. The rsna–asnr–miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv preprint arXiv:​2107.​02314. 2021.
34.
go back to reference Bakas S, Akbari H, Sotiras A, Bilello M, Rozycki M, Kirby JS, Freymann JB, Farahani K, Davatzikos C. Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features. Sci Data. 2017;4(1):1–13.CrossRef Bakas S, Akbari H, Sotiras A, Bilello M, Rozycki M, Kirby JS, Freymann JB, Farahani K, Davatzikos C. Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features. Sci Data. 2017;4(1):1–13.CrossRef
35.
go back to reference Menze BH, Jakab A, Bauer S, Kalpathy-Cramer J, Farahani K, Kirby J, Burren Y, Porz N, Slotboom J, Wiest R, et al. The multimodal brain tumor image segmentation benchmark (brats). IEEE Trans Med Imaging. 2014;34(10):1993–2024.CrossRefPubMedPubMedCentral Menze BH, Jakab A, Bauer S, Kalpathy-Cramer J, Farahani K, Kirby J, Burren Y, Porz N, Slotboom J, Wiest R, et al. The multimodal brain tumor image segmentation benchmark (brats). IEEE Trans Med Imaging. 2014;34(10):1993–2024.CrossRefPubMedPubMedCentral
38.
go back to reference He K, Zhang X, Ren S, Sun J. In: Proceedings of the IEEE international conference on computer vision. 2015; p. 1026–34. He K, Zhang X, Ren S, Sun J. In: Proceedings of the IEEE international conference on computer vision. 2015; p. 1026–34.
39.
go back to reference Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H. Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. 2017. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H. Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:​1704.​04861. 2017.
42.
go back to reference He K, Zhang X, Ren S, Sun J. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770–8. He K, Zhang X, Ren S, Sun J. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770–8.
43.
go back to reference Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst. 2019;32:66. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst. 2019;32:66.
46.
go back to reference Yushkevich PA, Piven J, Cody Hazlett H, Gimpel Smith R, Ho S, Gee JC, Gerig G. User-guided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability. Neuroimage. 2006;31(3), 1116–28. Yushkevich PA, Piven J, Cody Hazlett H, Gimpel Smith R, Ho S, Gee JC, Gerig G. User-guided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability. Neuroimage. 2006;31(3), 1116–28.
48.
go back to reference Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson J, Millman KJ, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ, Polat İ, Feng Y, Moore EW, VanderPlas J, Laxalde D, Perktold J, Cimrman R, Henriksen I, Quintero EA, Harris CR, Archibald AM, Ribeiro AH, Pedregosa F, van Mulbregt P SciPy 1.0 Contributors, SciPy 1.0: fundamental algorithms for scientific computing in python. Nat Methods. 2020;17:261–72. https://doi.org/10.1038/s41592-019-0686-2. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson J, Millman KJ, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ, Polat İ, Feng Y, Moore EW, VanderPlas J, Laxalde D, Perktold J, Cimrman R, Henriksen I, Quintero EA, Harris CR, Archibald AM, Ribeiro AH, Pedregosa F, van Mulbregt P SciPy 1.0 Contributors, SciPy 1.0: fundamental algorithms for scientific computing in python. Nat Methods. 2020;17:261–72. https://​doi.​org/​10.​1038/​s41592-019-0686-2.
Metadata
Title
Swin Unet3D: a three-dimensional medical image segmentation network combining vision transformer and convolution
Authors
Yimin Cai
Yuqing Long
Zhenggong Han
Mingkun Liu
Yuchen Zheng
Wei Yang
Liming Chen
Publication date
01-12-2023
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2023
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-023-02129-z

Other articles of this Issue 1/2023

BMC Medical Informatics and Decision Making 1/2023 Go to the issue