Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2024

Open Access 01-12-2024 | Prostate Cancer | Research

Machine learning pipeline to analyze clinical and proteomics data: experiences on a prostate cancer case

Authors: Patrizia Vizza, Federica Aracri, Pietro Hiram Guzzi, Marco Gaspari, Pierangelo Veltri, Giuseppe Tradigo

Published in: BMC Medical Informatics and Decision Making | Issue 1/2024

Login to get access

Abstract

Proteomic-based analysis is used to identify biomarkers in blood samples and tissues. Data produced by devices such as mass spectrometry requires platforms to identify and quantify proteins (or peptides). Clinical information can be related to mass spectrometry data to identify diseases at an early stage. Machine learning techniques can be used to support physicians and biologists in studying and classifying pathologies. We present the application of machine learning techniques to define a pipeline aimed at studying and classifying proteomics data enriched using clinical information. The pipeline allows users to relate established blood biomarkers with clinical parameters and proteomics data. The proposed pipeline entails three main phases: (i) feature selection, (ii) models training, and (iii) models ensembling. We report the experience of applying such a pipeline to prostate-related diseases. Models have been trained on several biological datasets. We report experimental results about two datasets that result from the integration of clinical and mass spectrometry-based data in the contexts of serum and urine analysis. The pipeline receives input data from blood analytes, tissue samples, proteomic analysis, and urine biomarkers. It then trains different models for feature selection, classification and voting. The presented pipeline has been applied on two datasets obtained in a 2 years research project which aimed to extract hidden information from mass spectrometry, serum, and urine samples from hundreds of patients. We report results on analyzing prostate datasets serum with 143 samples, including 79 PCa and 84 BPH patients, and an urine dataset with 121 samples, including 67 PCa and 54 BPH patients. As results pipeline allowed to identify interesting peptides in the two datasets, 6 for the first one and 2 for the second one. The best model for both serum (AUC=0.87, Accuracy=0.83, F1=0.81, Sensitivity=0.84, Specificity=0.81) and urine (AUC=0.88, Accuracy=0.83, F1=0.83, Sensitivity=0.85, Specificity=0.80) datasets showed good predictive performances. We made the pipeline code available on GitHub and we are confident that it will be successfully adopted in similar clinical setups.
Literature
1.
go back to reference Zhou X, Mao J, Ai J, Deng Y, Roth MR, Pound C, et al. Identification of plasma lipid biomarkers for prostate cancer by lipidomics and bioinformatics. PLoS ONE. 2012;7:e48889.CrossRefPubMedPubMedCentral Zhou X, Mao J, Ai J, Deng Y, Roth MR, Pound C, et al. Identification of plasma lipid biomarkers for prostate cancer by lipidomics and bioinformatics. PLoS ONE. 2012;7:e48889.CrossRefPubMedPubMedCentral
2.
go back to reference Vizza P, Pascuzzi L, Aracri F, Tavolaro E, Lambardi P, Gaspari M, et al. Prostate Cancer Disease Study by Integrating Peptides and Clinical Data. In: AAI4H@ ECAI. Amsterdam: IOS Press; 2020. p. 45–48. Vizza P, Pascuzzi L, Aracri F, Tavolaro E, Lambardi P, Gaspari M, et al. Prostate Cancer Disease Study by Integrating Peptides and Clinical Data. In: AAI4H@ ECAI. Amsterdam: IOS Press; 2020. p. 45–48.
3.
4.
go back to reference Pierre-Victor D, Parnes HL, Andriole GL, Pinsky PF. Prostate cancer incidence and mortality following a negative biopsy in a population undergoing PSA screening. Urology. 2021;155:62–9.CrossRefPubMed Pierre-Victor D, Parnes HL, Andriole GL, Pinsky PF. Prostate cancer incidence and mortality following a negative biopsy in a population undergoing PSA screening. Urology. 2021;155:62–9.CrossRefPubMed
5.
go back to reference White CN, Chan DW, Zhang Z. Bioinformatics strategies for proteomic profiling. Clin Biochem. 2004;37(7):636–41.CrossRefPubMed White CN, Chan DW, Zhang Z. Bioinformatics strategies for proteomic profiling. Clin Biochem. 2004;37(7):636–41.CrossRefPubMed
6.
go back to reference Petricoin EF III, Ornstein DK, Paweletz CP, Ardekani A, Hackett PS, Hitt BA, et al. Serum proteomic patterns for detection of prostate cancer. J Natl Cancer Inst. 2002;94(20):1576–8.CrossRefPubMed Petricoin EF III, Ornstein DK, Paweletz CP, Ardekani A, Hackett PS, Hitt BA, et al. Serum proteomic patterns for detection of prostate cancer. J Natl Cancer Inst. 2002;94(20):1576–8.CrossRefPubMed
7.
go back to reference Garg A, Mago V. Role of machine learning in medical research: a survey. Comput Sci Rev. 2021;40:100370.CrossRef Garg A, Mago V. Role of machine learning in medical research: a survey. Comput Sci Rev. 2021;40:100370.CrossRef
8.
go back to reference Mahmud M, Kaiser MS, McGinnity TM, Hussain A. Deep learning in mining biological data. Cogn Comput. 2021;13(1):1–33.CrossRef Mahmud M, Kaiser MS, McGinnity TM, Hussain A. Deep learning in mining biological data. Cogn Comput. 2021;13(1):1–33.CrossRef
9.
go back to reference Li Y, Wu FX, Ngom A. A review on machine learning principles for multi-view biological data integration. Brief Bioinform. 2018;19(2):325–40.PubMed Li Y, Wu FX, Ngom A. A review on machine learning principles for multi-view biological data integration. Brief Bioinform. 2018;19(2):325–40.PubMed
10.
go back to reference Khalsan M, Machado LR, Al-Shamery ES, Ajit S, Anthony K, Mu M, et al. A survey of machine learning approaches applied to gene expression analysis for cancer prediction. IEEE Access. 2022;10:27522–34.CrossRef Khalsan M, Machado LR, Al-Shamery ES, Ajit S, Anthony K, Mu M, et al. A survey of machine learning approaches applied to gene expression analysis for cancer prediction. IEEE Access. 2022;10:27522–34.CrossRef
11.
go back to reference Fan Z, Kong F, Zhou Y, Chen Y, Dai Y. Intelligence algorithms for protein classification by mass spectrometry. BioMed Res Int. 2018;2018. Fan Z, Kong F, Zhou Y, Chen Y, Dai Y. Intelligence algorithms for protein classification by mass spectrometry. BioMed Res Int. 2018;2018.
12.
go back to reference Taskin V, Dogan B, Ölmez T. Prostate cancer classification from mass spectrometry data by using wavelet analysis and Kernel Partial Least Squares Algorithm. Int J Biosci Biochem Bioinforma. 2013;3(2):98. Taskin V, Dogan B, Ölmez T. Prostate cancer classification from mass spectrometry data by using wavelet analysis and Kernel Partial Least Squares Algorithm. Int J Biosci Biochem Bioinforma. 2013;3(2):98.
14.
go back to reference Datta S, Pihur V. Feature selection and machine learning with mass spectrometry data. Bioinforma Methods Clin Res. 2010;593:205–29. Datta S, Pihur V. Feature selection and machine learning with mass spectrometry data. Bioinforma Methods Clin Res. 2010;593:205–29.
15.
go back to reference Khoo A, Liu LY, Nyalwidhe JO, Semmes OJ, Vesprini D, Downes MR, et al. Proteomic discovery of non-invasive biomarkers of localized prostate cancer using mass spectrometry. Nat Rev Urol. 2021;18(12):707–24.CrossRefPubMedPubMedCentral Khoo A, Liu LY, Nyalwidhe JO, Semmes OJ, Vesprini D, Downes MR, et al. Proteomic discovery of non-invasive biomarkers of localized prostate cancer using mass spectrometry. Nat Rev Urol. 2021;18(12):707–24.CrossRefPubMedPubMedCentral
16.
go back to reference Palopoli L, Rombo SE, Terracina G, Tradigo G, Veltri P. Improving protein secondary structure predictions by prediction fusion. Inf Fusion. 2009;10(3):217–32.CrossRef Palopoli L, Rombo SE, Terracina G, Tradigo G, Veltri P. Improving protein secondary structure predictions by prediction fusion. Inf Fusion. 2009;10(3):217–32.CrossRef
17.
go back to reference Theriault RL, Kaufmann M, Ren KY, Varma S, Ellis RE. Metabolomics patterns of breast cancer tumors using mass spectrometry imaging. Int J CARS. 2021;16(7):1089–99.CrossRef Theriault RL, Kaufmann M, Ren KY, Varma S, Ellis RE. Metabolomics patterns of breast cancer tumors using mass spectrometry imaging. Int J CARS. 2021;16(7):1089–99.CrossRef
18.
go back to reference Roseiro M, Henriques J, Paredes S, Rocha T, Sousa J. An interpretable machine learning approach to estimate the influence of inflammation biomarkers on cardiovascular risk assessment. Comput Methods Prog Biomed. 2023;230:107347. Roseiro M, Henriques J, Paredes S, Rocha T, Sousa J. An interpretable machine learning approach to estimate the influence of inflammation biomarkers on cardiovascular risk assessment. Comput Methods Prog Biomed. 2023;230:107347.
19.
go back to reference Battista A, Battista RA, Battista F, Iovane G, Landi RE. BH-index: a predictive system based on serum biomarkers and ensemble learning for early colorectal cancer diagnosis in mass screening. Comput Methods Prog Biomed. 2021;212:106494.CrossRef Battista A, Battista RA, Battista F, Iovane G, Landi RE. BH-index: a predictive system based on serum biomarkers and ensemble learning for early colorectal cancer diagnosis in mass screening. Comput Methods Prog Biomed. 2021;212:106494.CrossRef
21.
go back to reference Taghizadeh E, Heydarheydari S, Saberi A, JafarpoorNesheli S, Rezaeijo SM. Breast cancer prediction with transcriptome profiling using feature selection and machine learning methods. BMC Bioinformatics. 2022;23(1):1–9.CrossRef Taghizadeh E, Heydarheydari S, Saberi A, JafarpoorNesheli S, Rezaeijo SM. Breast cancer prediction with transcriptome profiling using feature selection and machine learning methods. BMC Bioinformatics. 2022;23(1):1–9.CrossRef
22.
go back to reference Botlagunta M, Botlagunta MD, Myneni MB, Lakshmi D, Nayyar A, Gullapalli JS, et al. Classification and diagnostic prediction of breast cancer metastasis on clinical data using machine learning algorithms. Sci Rep. 2023;13(1):485.CrossRefPubMedPubMedCentral Botlagunta M, Botlagunta MD, Myneni MB, Lakshmi D, Nayyar A, Gullapalli JS, et al. Classification and diagnostic prediction of breast cancer metastasis on clinical data using machine learning algorithms. Sci Rep. 2023;13(1):485.CrossRefPubMedPubMedCentral
23.
go back to reference Kopitar L, Kocbek P, Cilar L, Sheikh A, Stiglic G. Early detection of type 2 diabetes mellitus using machine learning-based prediction models. Sci Rep. 2020;10(1):1–12.CrossRef Kopitar L, Kocbek P, Cilar L, Sheikh A, Stiglic G. Early detection of type 2 diabetes mellitus using machine learning-based prediction models. Sci Rep. 2020;10(1):1–12.CrossRef
24.
go back to reference Srivastava S, Soman S, Rai A, Srivastava PK. Deep learning for health informatics: recent trends and future directions. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE; 2017. p. 1665–1670. Srivastava S, Soman S, Rai A, Srivastava PK. Deep learning for health informatics: recent trends and future directions. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE; 2017. p. 1665–1670.
25.
go back to reference Callahan A, Shah NH. Machine learning in healthcare. In: Key Advances in Clinical Informatics. Elsevier; 2017. p. 279–291. Callahan A, Shah NH. Machine learning in healthcare. In: Key Advances in Clinical Informatics. Elsevier; 2017. p. 279–291.
26.
go back to reference Paul TK, Iba H. Prediction of cancer class with majority voting genetic programming classifier using gene expression data. IEEE/ACM Trans Comput Biol Bioinforma. 2008;6(2):353–67.CrossRef Paul TK, Iba H. Prediction of cancer class with majority voting genetic programming classifier using gene expression data. IEEE/ACM Trans Comput Biol Bioinforma. 2008;6(2):353–67.CrossRef
27.
go back to reference Prestagiacomo L, Tradigo G, Aracri F, Gabriele C, Rota MA, Alba S, et al. Data-Independent Acquisition Mass Spectrometry of EPS-urine coupled to Machine Learning: a predictive model for prostate cancer. ACS Omega; 2023. Prestagiacomo L, Tradigo G, Aracri F, Gabriele C, Rota MA, Alba S, et al. Data-Independent Acquisition Mass Spectrometry of EPS-urine coupled to Machine Learning: a predictive model for prostate cancer. ACS Omega; 2023.
28.
go back to reference Gabriele C, Aracri F, Prestagiacomo LE, Rota MA, Alba S, Tradigo G, et al. Development of a predictive model to distinguish prostate cancer from benign prostatic hyperplasia by integrating serum glycoproteomics and clinical variables. Clin Proteomics. 2023;20(1):52.CrossRefPubMedPubMedCentral Gabriele C, Aracri F, Prestagiacomo LE, Rota MA, Alba S, Tradigo G, et al. Development of a predictive model to distinguish prostate cancer from benign prostatic hyperplasia by integrating serum glycoproteomics and clinical variables. Clin Proteomics. 2023;20(1):52.CrossRefPubMedPubMedCentral
29.
go back to reference Beg M, Taka J, Kluyver T, Konovalov A, Ragan-Kelley M, Thiéry NM, et al. Using Jupyter for reproducible scientific workflows. Comput Sci Eng. 2021;23(2):36–46.CrossRef Beg M, Taka J, Kluyver T, Konovalov A, Ragan-Kelley M, Thiéry NM, et al. Using Jupyter for reproducible scientific workflows. Comput Sci Eng. 2021;23(2):36–46.CrossRef
30.
go back to reference Mukaka MM. A guide to appropriate use of correlation coefficient in medical research. Malawi Med J. 2012;24(3):69–71.PubMedPubMedCentral Mukaka MM. A guide to appropriate use of correlation coefficient in medical research. Malawi Med J. 2012;24(3):69–71.PubMedPubMedCentral
31.
go back to reference Tallarida RJ, Murray RB. Chi-square test. In: Manual of pharmacologic calculations. Springer; 1987. p. 140–142. Tallarida RJ, Murray RB. Chi-square test. In: Manual of pharmacologic calculations. Springer; 1987. p. 140–142.
32.
go back to reference Vanjimalar S, Ramyachitra D, Manikandan P. A review on feature selection techniques for gene expression data. In: 2018 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC). IEEE; 2018. p. 1–4. Vanjimalar S, Ramyachitra D, Manikandan P. A review on feature selection techniques for gene expression data. In: 2018 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC). IEEE; 2018. p. 1–4.
33.
go back to reference Speiser JL, Miller ME, Tooze J, Ip E. A comparison of random forest variable selection methods for classification prediction modeling. Expert Syst Appl. 2019;134:93–101.CrossRefPubMedPubMedCentral Speiser JL, Miller ME, Tooze J, Ip E. A comparison of random forest variable selection methods for classification prediction modeling. Expert Syst Appl. 2019;134:93–101.CrossRefPubMedPubMedCentral
34.
go back to reference Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12–22.CrossRefPubMed Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12–22.CrossRefPubMed
35.
go back to reference Huang HC, Zheng S, Zhao Z. Application of Pearson correlation coefficient (PCC) and Kolmogorov-Smirnov distance (KSD) metrics to identify disease-specific biomarker genes. BMC Bioinformatics. 2010;11:P23.CrossRefPubMedCentral Huang HC, Zheng S, Zhao Z. Application of Pearson correlation coefficient (PCC) and Kolmogorov-Smirnov distance (KSD) metrics to identify disease-specific biomarker genes. BMC Bioinformatics. 2010;11:P23.CrossRefPubMedCentral
36.
go back to reference Wang L, Jiang Z, Sui M, Shen J, Xu C, Fan W. The potential biomarkers in predicting pathologic response of breast cancer to three different chemotherapy regimens: a case control study. BMC Cancer. 2009;9:226.CrossRefPubMedPubMedCentral Wang L, Jiang Z, Sui M, Shen J, Xu C, Fan W. The potential biomarkers in predicting pathologic response of breast cancer to three different chemotherapy regimens: a case control study. BMC Cancer. 2009;9:226.CrossRefPubMedPubMedCentral
37.
go back to reference Lv Y, Wang Y, Tan Y, Du W, Liu K, Wang H. Pancreatic cancer biomarker detection using recursive feature elimination based on Support Vector Machine and large margin distribution machine. 4th International Conference on Systems and Informatics (ICSAI). New York: IEEE; 2017. p. 1450–1455. Lv Y, Wang Y, Tan Y, Du W, Liu K, Wang H. Pancreatic cancer biomarker detection using recursive feature elimination based on Support Vector Machine and large margin distribution machine. 4th International Conference on Systems and Informatics (ICSAI). New York: IEEE; 2017. p. 1450–1455.
38.
go back to reference Ram M, Najafi A, Shakeri MT. Classification and biomarker genes selection for cancer gene expression data using random forest. Iran J Pathol. 2017;12:339.CrossRefPubMedPubMedCentral Ram M, Najafi A, Shakeri MT. Classification and biomarker genes selection for cancer gene expression data using random forest. Iran J Pathol. 2017;12:339.CrossRefPubMedPubMedCentral
39.
go back to reference Aggarwal CC, et al. Data mining: the textbook, vol 1. Springer; 2015. Aggarwal CC, et al. Data mining: the textbook, vol 1. Springer; 2015.
41.
go back to reference Cannataro M, Guzzi PH, Mazza T, Tradigo G, Veltri P. Using ontologies for preprocessing and mining spectra data on the Grid. Futur Gener Comput Syst. 2007;23(1):55–60.CrossRef Cannataro M, Guzzi PH, Mazza T, Tradigo G, Veltri P. Using ontologies for preprocessing and mining spectra data on the Grid. Futur Gener Comput Syst. 2007;23(1):55–60.CrossRef
42.
go back to reference Din S, Paul A, Guizani N, Ahmed SH, Khan M, Rathore MM. Features selection model for internet of e-health things using big data. In: GLOBECOM 2017-2017 IEEE Global Communications Conference. IEEE; 2017. p. 1–7. Din S, Paul A, Guizani N, Ahmed SH, Khan M, Rathore MM. Features selection model for internet of e-health things using big data. In: GLOBECOM 2017-2017 IEEE Global Communications Conference. IEEE; 2017. p. 1–7.
43.
go back to reference Naheed N, Shaheen M, Khan SA, Alawairdhi M, Khan MA. Importance of features selection, attributes selection, challenges and future directions for medical imaging data: a review. Comput Model Eng Sci. 2020;125(1):314–44. Naheed N, Shaheen M, Khan SA, Alawairdhi M, Khan MA. Importance of features selection, attributes selection, challenges and future directions for medical imaging data: a review. Comput Model Eng Sci. 2020;125(1):314–44.
44.
go back to reference Goh WWB, Wong L. Advanced bioinformatics methods for practical applications in proteomics. Brief Bioinform. 2019;20(1):347–55.CrossRefPubMed Goh WWB, Wong L. Advanced bioinformatics methods for practical applications in proteomics. Brief Bioinform. 2019;20(1):347–55.CrossRefPubMed
45.
go back to reference Gallo Cantafio ME, Grillone K, Caracciolo D, Scionti F, Arbitrio M, Barbieri V, et al. From single level analysis to multi-omics integrative approaches: a powerful strategy towards the precision oncology. High-throughput. 2018;7(4):33.CrossRefPubMedPubMedCentral Gallo Cantafio ME, Grillone K, Caracciolo D, Scionti F, Arbitrio M, Barbieri V, et al. From single level analysis to multi-omics integrative approaches: a powerful strategy towards the precision oncology. High-throughput. 2018;7(4):33.CrossRefPubMedPubMedCentral
46.
go back to reference Chandrashekar G, Sahin F. A survey on feature selection methods. Comput Electr Eng. 2014;40(1):16–28.CrossRef Chandrashekar G, Sahin F. A survey on feature selection methods. Comput Electr Eng. 2014;40(1):16–28.CrossRef
47.
go back to reference Malm EK, Srivastava V, Sundqvist G, Bulone V. APP: an Automated Proteomics Pipeline for the analysis of mass spectrometry data based on multiple open access tools. BMC Bioinformatics. 2014;15:1–8.CrossRef Malm EK, Srivastava V, Sundqvist G, Bulone V. APP: an Automated Proteomics Pipeline for the analysis of mass spectrometry data based on multiple open access tools. BMC Bioinformatics. 2014;15:1–8.CrossRef
48.
go back to reference Weber SR, Zhao Y, Ma J, Gates C, da Veiga Leprevost F, Basrur V, et al. A validated analysis pipeline for mass spectrometry-based vitreous proteomics: new insights into proliferative diabetic retinopathy. Clin Proteomics. 2021;18:1–27.CrossRef Weber SR, Zhao Y, Ma J, Gates C, da Veiga Leprevost F, Basrur V, et al. A validated analysis pipeline for mass spectrometry-based vitreous proteomics: new insights into proliferative diabetic retinopathy. Clin Proteomics. 2021;18:1–27.CrossRef
49.
go back to reference Bichmann L, Gupta S, Rosenberger G, Kuchenbecker L, Sachsenberg T, Ewels P, et al. DIAproteomics: a multifunctional data analysis pipeline for data-independent acquisition proteomics and peptidomics. J Proteome Res. 2021;20(7):3758–66.CrossRefPubMed Bichmann L, Gupta S, Rosenberger G, Kuchenbecker L, Sachsenberg T, Ewels P, et al. DIAproteomics: a multifunctional data analysis pipeline for data-independent acquisition proteomics and peptidomics. J Proteome Res. 2021;20(7):3758–66.CrossRefPubMed
50.
go back to reference Keller A, Shteynberg D. Software pipeline and data analysis for MS/MS proteomics: the trans-proteomic pipeline. Bioinforma Comp Proteomics. 2011;694:169–89. Keller A, Shteynberg D. Software pipeline and data analysis for MS/MS proteomics: the trans-proteomic pipeline. Bioinforma Comp Proteomics. 2011;694:169–89.
51.
go back to reference Liang D, Liu Q, Zhou K, Jia W, Xie G, Chen T. IP4M: an integrated platform for mass spectrometry-based metabolomics data mining. BMC Bioinformatics. 2020;21(1):1–16.CrossRef Liang D, Liu Q, Zhou K, Jia W, Xie G, Chen T. IP4M: an integrated platform for mass spectrometry-based metabolomics data mining. BMC Bioinformatics. 2020;21(1):1–16.CrossRef
Metadata
Title
Machine learning pipeline to analyze clinical and proteomics data: experiences on a prostate cancer case
Authors
Patrizia Vizza
Federica Aracri
Pietro Hiram Guzzi
Marco Gaspari
Pierangelo Veltri
Giuseppe Tradigo
Publication date
01-12-2024
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2024
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-024-02491-6

Other articles of this Issue 1/2024

BMC Medical Informatics and Decision Making 1/2024 Go to the issue