Skip to main content
Top
Published in: European Journal of Epidemiology 2/2021

Open Access 01-02-2021 | METHODS

Data extraction for epidemiological research (DExtER): a novel tool for automated clinical epidemiology studies

Authors: Krishna Margadhamane Gokhale, Joht Singh Chandan, Konstantinos Toulis, Georgios Gkoutos, Peter Tino, Krishnarajah Nirantharakumar

Published in: European Journal of Epidemiology | Issue 2/2021

Login to get access

Abstract

The use of primary care electronic health records for research is abundant. The benefits gained from utilising such records lies in their size, longitudinal data collection and data quality. However, the use of such data to undertake high quality epidemiological studies, can lead to significant challenges particularly in dealing with misclassification, variation in coding and the significant effort required to pre-process the data in a meaningful format for statistical analysis. In this paper, we describe a methodology to aid with the extraction and processing of such databases, delivered by a novel software programme; the “Data extraction for epidemiological research” (DExtER). The basis of DExtER relies on principles of extract, transform and load processes. The tool initially provides the ability for the healthcare dataset to be extracted, then transformed in a format whereby data is normalised, converted and reformatted. DExtER has a user interface designed to obtain data extracts specific to each research question and observational study design. There are facilities to input the requirements for; eligible study period, definition of exposed and unexposed groups, outcome measures and important baseline covariates. To date the tool has been utilised and validated in a multitude of settings. There have been over 35 peer-reviewed publications using the tool, and DExtER has been implemented as a validated public health surveillance tool for obtaining accurate statistics on epidemiology of key morbidities. Future direction of this work will be the application of the framework to linked as well as international datasets and the development of standardised methods for conducting electronic pre-processing and extraction from datasets for research purposes.
Appendix
Available only for authorised users
Literature
1.
go back to reference Protti D. Comparison of information technology in general practice in 10 countries. Healthc Q. 2006;10:107–16.CrossRef Protti D. Comparison of information technology in general practice in 10 countries. Healthc Q. 2006;10:107–16.CrossRef
2.
go back to reference Curcin V, Soljak M, Majeed A. Managing and exploiting routinely collected NHS data for research. J Innov Health Inform. 2013;20:225–31. Curcin V, Soljak M, Majeed A. Managing and exploiting routinely collected NHS data for research. J Innov Health Inform. 2013;20:225–31.
5.
go back to reference John O, Donoghue HJ. Data management within mHealth environments: patient sensors, mobile devices, and databases. J Data Inf Qual. 2012;4:1–20. John O, Donoghue HJ. Data management within mHealth environments: patient sensors, mobile devices, and databases. J Data Inf Qual. 2012;4:1–20.
6.
go back to reference Murdoch TB, Detsky AS. The inevitable application of big data to health care. JAMA. 2013;309:1351–2.CrossRef Murdoch TB, Detsky AS. The inevitable application of big data to health care. JAMA. 2013;309:1351–2.CrossRef
7.
go back to reference Hippisley-Cox J, Stables D, Pringle M. QRESEARCH: a new general practice database for research. J Innov Health Inform. 2004;12:49–50.CrossRef Hippisley-Cox J, Stables D, Pringle M. QRESEARCH: a new general practice database for research. J Innov Health Inform. 2004;12:49–50.CrossRef
8.
go back to reference Herrett E, Gallagher AM, Bhaskaran K, Forbes H, Mathur R, van Staa T, Smeeth L. Data resource profile: clinical practice research datalink (CPRD). Int J Epidemiol. 2015;44:827–36.CrossRef Herrett E, Gallagher AM, Bhaskaran K, Forbes H, Mathur R, van Staa T, Smeeth L. Data resource profile: clinical practice research datalink (CPRD). Int J Epidemiol. 2015;44:827–36.CrossRef
10.
go back to reference Lin J-H, Haug PJ. Data preparation framework for preprocessing clinical data in data mining. In: AMIA annual symposium proceedings; 2006. Lin J-H, Haug PJ. Data preparation framework for preprocessing clinical data in data mining. In: AMIA annual symposium proceedings; 2006.
11.
go back to reference Wasserman RC. Electronic medical records (EMRs), epidemiology, and epistemology: reflections on EMRs and future pediatric clinical research. Acad Pediatr. 2011;11:280–7.CrossRef Wasserman RC. Electronic medical records (EMRs), epidemiology, and epistemology: reflections on EMRs and future pediatric clinical research. Acad Pediatr. 2011;11:280–7.CrossRef
12.
go back to reference de Lusignan S, van Weel C. The use of routinely collected computer data for research in primary care: opportunities and challenges. Fam Pract. 2006;23:253–63.CrossRef de Lusignan S, van Weel C. The use of routinely collected computer data for research in primary care: opportunities and challenges. Fam Pract. 2006;23:253–63.CrossRef
13.
go back to reference Williams T, Van Staa T, Puri S, Eaton S. Recent advances in the utility and use of the General Practice Research Database as an example of a UK Primary Care Data resource. Ther Adv Drug Saf. 2012;3:89–99.CrossRef Williams T, Van Staa T, Puri S, Eaton S. Recent advances in the utility and use of the General Practice Research Database as an example of a UK Primary Care Data resource. Ther Adv Drug Saf. 2012;3:89–99.CrossRef
14.
go back to reference Toulis KA, Willis BH, Marshall T, et al. All-cause mortality in patients with diabetes under treatment with dapagliflozin: a population-based, open-cohort study in the health improvement network database. J Clin Endocrinol Metab. 2017;102:1719–25.CrossRef Toulis KA, Willis BH, Marshall T, et al. All-cause mortality in patients with diabetes under treatment with dapagliflozin: a population-based, open-cohort study in the health improvement network database. J Clin Endocrinol Metab. 2017;102:1719–25.CrossRef
15.
go back to reference Harvey PR, Thomas T, Chandan JS, Mytton J, Coupland B, Bhala N, Evison F, Patel P, Nirantharakumar K, Trudgill NJ. Incidence, morbidity and mortality of patients with achalasia in England: findings from a study of nationwide hospital and primary care data. Gut. 2018;68:790–5.CrossRef Harvey PR, Thomas T, Chandan JS, Mytton J, Coupland B, Bhala N, Evison F, Patel P, Nirantharakumar K, Trudgill NJ. Incidence, morbidity and mortality of patients with achalasia in England: findings from a study of nationwide hospital and primary care data. Gut. 2018;68:790–5.CrossRef
16.
go back to reference Adderley NJ, Nirantharakumar K, Marshall T. Risk of stroke and transient ischaemic attack in patients with a diagnosis of resolved atrial fibrillation: retrospective cohort studies. BMJ. 2018;361:k1717.PubMedPubMedCentral Adderley NJ, Nirantharakumar K, Marshall T. Risk of stroke and transient ischaemic attack in patients with a diagnosis of resolved atrial fibrillation: retrospective cohort studies. BMJ. 2018;361:k1717.PubMedPubMedCentral
17.
go back to reference Yao Q, Chen K, Yao L, Lyu P, Yang T, Luo F, Chen S, He L, Liu Z. Scientometric trends and knowledge maps of global health systems research. Health Res Policy Syst. 2014;12:26.CrossRef Yao Q, Chen K, Yao L, Lyu P, Yang T, Luo F, Chen S, He L, Liu Z. Scientometric trends and knowledge maps of global health systems research. Health Res Policy Syst. 2014;12:26.CrossRef
18.
go back to reference Hall GC, Sauer B, Bourke A, Brown JS, Reynolds MW, Lo CR. Guidelines for good database selection and use in pharmacoepidemiology research. Pharmacoepidemiol Drug Saf. 2012;21:1–10.CrossRef Hall GC, Sauer B, Bourke A, Brown JS, Reynolds MW, Lo CR. Guidelines for good database selection and use in pharmacoepidemiology research. Pharmacoepidemiol Drug Saf. 2012;21:1–10.CrossRef
19.
go back to reference Springate DA, Parisi R, Olier I, Reeves D, Kontopantelis E. rEHR: an R package for manipulating and analysing electronic health record data. PLoS ONE. 2017;12:e0171784.CrossRef Springate DA, Parisi R, Olier I, Reeves D, Kontopantelis E. rEHR: an R package for manipulating and analysing electronic health record data. PLoS ONE. 2017;12:e0171784.CrossRef
20.
go back to reference The European Health Data & Evidence Network’s (EHDEN) (2015) The European Health Data & Evidence Network’s (EHDEN) OHDSI ATLAS. The European Health Data & Evidence Network’s (EHDEN) (2015) The European Health Data & Evidence Network’s (EHDEN) OHDSI ATLAS.
22.
go back to reference Vassiliadis P, Simitsis A. Extraction, transformation, and loading. In: Encyclopedia of database systems. Berlin: Springer; 2009, pp 1095–1101. Vassiliadis P, Simitsis A. Extraction, transformation, and loading. In: Encyclopedia of database systems. Berlin: Springer; 2009, pp 1095–1101.
23.
go back to reference Murphy S. Data warehousing for clinical research. In: Encyclopedia of database systems. Berlin: Springer; 2009, pp 679–84. Murphy S. Data warehousing for clinical research. In: Encyclopedia of database systems. Berlin: Springer; 2009, pp 679–84.
24.
go back to reference Pecoraro F, Luzi D, Ricci FL. Designing ETL tools to feed a data warehouse based on electronic healthcare record infrastructure. Studies Health Technol Inform. 2015;210:929–33. Pecoraro F, Luzi D, Ricci FL. Designing ETL tools to feed a data warehouse based on electronic healthcare record infrastructure. Studies Health Technol Inform. 2015;210:929–33.
25.
go back to reference Horvath MM, Winfield S, Evans S, Slopek S, Shang H, Ferranti J. The DEDUCE Guided Query tool: providing simplified access to clinical data for research and quality improvement. J Biomed Inform. 2011;44:266–76.CrossRef Horvath MM, Winfield S, Evans S, Slopek S, Shang H, Ferranti J. The DEDUCE Guided Query tool: providing simplified access to clinical data for research and quality improvement. J Biomed Inform. 2011;44:266–76.CrossRef
27.
go back to reference Lenzerini M. Data integration: a theoretical perspective. In: Proceedings of the twenty-first ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems. ACM; 2002, pp 233–246. Lenzerini M. Data integration: a theoretical perspective. In: Proceedings of the twenty-first ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems. ACM; 2002, pp 233–246.
31.
go back to reference Vassiliadis P, Simitsis A, Skiadopoulos S. Conceptual modeling for ETL processes. In: Proceedings of the 8th ACM international workshop on Data warehousing and OLAP. ACM; 2002, pp 14–21. Vassiliadis P, Simitsis A, Skiadopoulos S. Conceptual modeling for ETL processes. In: Proceedings of the 8th ACM international workshop on Data warehousing and OLAP. ACM; 2002, pp 14–21.
32.
go back to reference Trujillo J, Luján-Mora S (2003) A UML based approach for modeling ETL processes in data warehouses. In: International conference on conceptual modeling. Berlin: Springer, pp 307–20. Trujillo J, Luján-Mora S (2003) A UML based approach for modeling ETL processes in data warehouses. In: International conference on conceptual modeling. Berlin: Springer, pp 307–20.
34.
go back to reference Herrett E, Gallagher AM, Bhaskaran K, Forbes H, Mathur R, van Staa T, Smeeth L. Data resource profile: clinical practice research datalink (CPRD). Int J Epidemiol. 2015;44:827–36.CrossRef Herrett E, Gallagher AM, Bhaskaran K, Forbes H, Mathur R, van Staa T, Smeeth L. Data resource profile: clinical practice research datalink (CPRD). Int J Epidemiol. 2015;44:827–36.CrossRef
35.
go back to reference Wolf A, Dedman D, Campbell J, Booth H, Lunn D, Chapman J, Myles P. Data resource profile: clinical Practice Research Datalink (CPRD) Aurum. Int J Epidemiol. 2019;48:1740–1740g.CrossRef Wolf A, Dedman D, Campbell J, Booth H, Lunn D, Chapman J, Myles P. Data resource profile: clinical Practice Research Datalink (CPRD) Aurum. Int J Epidemiol. 2019;48:1740–1740g.CrossRef
36.
go back to reference Horsfall L, Walters K, Petersen I. Identifying periods of acceptable computer usage in primary care research databases. Pharmacoepidemiol Drug Saf. 2013;22:64–9.CrossRef Horsfall L, Walters K, Petersen I. Identifying periods of acceptable computer usage in primary care research databases. Pharmacoepidemiol Drug Saf. 2013;22:64–9.CrossRef
37.
go back to reference Maguire A, Blak BT, Thompson M. The importance of defining periods of complete mortality reporting for research using automated data from primary care. Pharmacoepidemiol Drug Saf. 2009;18:76–83.CrossRef Maguire A, Blak BT, Thompson M. The importance of defining periods of complete mortality reporting for research using automated data from primary care. Pharmacoepidemiol Drug Saf. 2009;18:76–83.CrossRef
38.
go back to reference Okhotin A. Recursive descent parsing for Boolean grammars. Acta Inform. 2007;44:167–89.CrossRef Okhotin A. Recursive descent parsing for Boolean grammars. Acta Inform. 2007;44:167–89.CrossRef
40.
go back to reference Toulis KA, Willis BH, Marshall T, Kumarendran B, Gokhale K, Ghosh S, Thomas GN, Cheng KK, Narendran P, Hanif W. All-cause mortality in patients with diabetes under treatment with dapagliflozin: a population-based, open-cohort study in THIN database. J Clin Endocrinol Metab. 2017;102(5):1719–25.CrossRef Toulis KA, Willis BH, Marshall T, Kumarendran B, Gokhale K, Ghosh S, Thomas GN, Cheng KK, Narendran P, Hanif W. All-cause mortality in patients with diabetes under treatment with dapagliflozin: a population-based, open-cohort study in THIN database. J Clin Endocrinol Metab. 2017;102(5):1719–25.CrossRef
41.
go back to reference Tracy A, Subramanian A, Adderley NJ, Cockwell P, Ferro C, Ball S, Harper L, Nirantharakumar K. Cardiovascular, thromboembolic and renal outcomes in IgA vasculitis (Henoch–Schönlein purpura): a retrospective cohort study using routinely collected primary care data. Ann Rheum Dis. 2019;78:261–9.CrossRef Tracy A, Subramanian A, Adderley NJ, Cockwell P, Ferro C, Ball S, Harper L, Nirantharakumar K. Cardiovascular, thromboembolic and renal outcomes in IgA vasculitis (Henoch–Schönlein purpura): a retrospective cohort study using routinely collected primary care data. Ann Rheum Dis. 2019;78:261–9.CrossRef
43.
go back to reference Chandan JS, Thomas T, Bradbury-Jones C, Russell R, Bandyopadhyay S, Nirantharakumar K, Taylor J. Female survivors of intimate partner violence and risk of depression, anxiety and serious mental illness. Br J Psychiatry 1–6. 2019. Chandan JS, Thomas T, Bradbury-Jones C, Russell R, Bandyopadhyay S, Nirantharakumar K, Taylor J. Female survivors of intimate partner violence and risk of depression, anxiety and serious mental illness. Br J Psychiatry 1–6. 2019.
44.
go back to reference Chandan JS, Thomas T, Gokhale KM, Bandyopadhyay S, Taylor J, Nirantharakumar K. The burden of mental ill health associated with childhood maltreatment in the UK, using The Health Improvement Network database: a population-based retrospective cohort study. Lancet Psychiatry. 2019;6:926–34.CrossRef Chandan JS, Thomas T, Gokhale KM, Bandyopadhyay S, Taylor J, Nirantharakumar K. The burden of mental ill health associated with childhood maltreatment in the UK, using The Health Improvement Network database: a population-based retrospective cohort study. Lancet Psychiatry. 2019;6:926–34.CrossRef
45.
go back to reference Lund JL, Richardson DB, Stürmer T. The active comparator, new user study design in pharmacoepidemiology: historical foundations and contemporary application. Curr Epidemiol Rep. 2015;2:221–8.CrossRef Lund JL, Richardson DB, Stürmer T. The active comparator, new user study design in pharmacoepidemiology: historical foundations and contemporary application. Curr Epidemiol Rep. 2015;2:221–8.CrossRef
46.
go back to reference Suissa S, Moodie EEM, Dell’Aniello S. Prevalent new-user cohort designs for comparative drug effect studies by time-conditional propensity scores. Pharmacoepidemiol Drug Saf. 2017;26:459–68.CrossRef Suissa S, Moodie EEM, Dell’Aniello S. Prevalent new-user cohort designs for comparative drug effect studies by time-conditional propensity scores. Pharmacoepidemiol Drug Saf. 2017;26:459–68.CrossRef
47.
go back to reference Lévesque LE, Hanley JA, Kezouh A, Suissa S. Problem of immortal time bias in cohort studies: example using statins for preventing progression of diabetes. BMJ. 2010;340:b5087.CrossRef Lévesque LE, Hanley JA, Kezouh A, Suissa S. Problem of immortal time bias in cohort studies: example using statins for preventing progression of diabetes. BMJ. 2010;340:b5087.CrossRef
49.
go back to reference Dafoulas GE, Toulis KA, Mccorry D, Kumarendran B, Thomas GN, Willis BH, Gokhale K, Gkoutos G, Narendran P, Nirantharakumar K. Type 1 diabetes mellitus and risk of incident epilepsy: a population-based, open-cohort study. Diabetologia. 2017;60:258–61.CrossRef Dafoulas GE, Toulis KA, Mccorry D, Kumarendran B, Thomas GN, Willis BH, Gokhale K, Gkoutos G, Narendran P, Nirantharakumar K. Type 1 diabetes mellitus and risk of incident epilepsy: a population-based, open-cohort study. Diabetologia. 2017;60:258–61.CrossRef
50.
go back to reference McCorry D, Nicolson A, Smith D, Marson A, Feltbower RG, Chadwick DW. An association between type 1 diabetes and idiopathic generalized epilepsy. Ann Neurol. 2006;59:204–6.CrossRef McCorry D, Nicolson A, Smith D, Marson A, Feltbower RG, Chadwick DW. An association between type 1 diabetes and idiopathic generalized epilepsy. Ann Neurol. 2006;59:204–6.CrossRef
51.
go back to reference O’Connell MA, Harvey AS, Mackay MT, Cameron FJ. Does epilepsy occur more frequently in children with Type 1 diabetes? J Paediatr Child Health. 2008;44:586–9.CrossRef O’Connell MA, Harvey AS, Mackay MT, Cameron FJ. Does epilepsy occur more frequently in children with Type 1 diabetes? J Paediatr Child Health. 2008;44:586–9.CrossRef
52.
go back to reference Mancardi MM, Striano P, Giannattasio A, et al. Type 1 diabetes and epilepsy: more than a casual association? Epilepsia. 2010;51:320–1.CrossRef Mancardi MM, Striano P, Giannattasio A, et al. Type 1 diabetes and epilepsy: more than a casual association? Epilepsia. 2010;51:320–1.CrossRef
53.
go back to reference Chou I-C, Wang C-H, Lin W-D, Tsai F-J, Lin C-C, Kao C-H. Risk of epilepsy in type 1 diabetes mellitus: a population-based cohort study. Diabetologia. 2016;59:1196–203.CrossRef Chou I-C, Wang C-H, Lin W-D, Tsai F-J, Lin C-C, Kao C-H. Risk of epilepsy in type 1 diabetes mellitus: a population-based cohort study. Diabetologia. 2016;59:1196–203.CrossRef
54.
go back to reference Neligan A, Sander JW. The incidence and prevalence of epilepsy. London: UCL Institute of Neurology; 2009. Neligan A, Sander JW. The incidence and prevalence of epilepsy. London: UCL Institute of Neurology; 2009.
55.
go back to reference Adderley NJ, Ryan R, Nirantharakumar K, Marshall T. Prevalence and treatment of atrial fibrillation in UK general practice from 2000 to 2016. Heart. 2019;105:27–33.CrossRef Adderley NJ, Ryan R, Nirantharakumar K, Marshall T. Prevalence and treatment of atrial fibrillation in UK general practice from 2000 to 2016. Heart. 2019;105:27–33.CrossRef
56.
go back to reference Zinman B, Wanner C, Lachin JM, et al. Empagliflozin, cardiovascular outcomes, and mortality in type 2 diabetes. N Engl J Med. 2015;373:2117–28.CrossRef Zinman B, Wanner C, Lachin JM, et al. Empagliflozin, cardiovascular outcomes, and mortality in type 2 diabetes. N Engl J Med. 2015;373:2117–28.CrossRef
57.
go back to reference Kosiborod M, Cavender MA, Fu AZ, et al. Lower risk of heart failure and death in patients initiated on sodium-glucose cotransporter-2 inhibitors versus other glucose-lowering drugs. Circulation. 2017;136:249–59.CrossRef Kosiborod M, Cavender MA, Fu AZ, et al. Lower risk of heart failure and death in patients initiated on sodium-glucose cotransporter-2 inhibitors versus other glucose-lowering drugs. Circulation. 2017;136:249–59.CrossRef
60.
go back to reference Peng RD, Dominici F, Zeger SL. Reproducible epidemiologic research. Am J Epidemiol. 2006;163:783–9.CrossRef Peng RD, Dominici F, Zeger SL. Reproducible epidemiologic research. Am J Epidemiol. 2006;163:783–9.CrossRef
61.
go back to reference Easterbrook PJ, Berlin JA, Gopalan R, Matthews DR. Publication bias in clinical research. Lancet (Lond Engl). 1991;337:867–72.CrossRef Easterbrook PJ, Berlin JA, Gopalan R, Matthews DR. Publication bias in clinical research. Lancet (Lond Engl). 1991;337:867–72.CrossRef
Metadata
Title
Data extraction for epidemiological research (DExtER): a novel tool for automated clinical epidemiology studies
Authors
Krishna Margadhamane Gokhale
Joht Singh Chandan
Konstantinos Toulis
Georgios Gkoutos
Peter Tino
Krishnarajah Nirantharakumar
Publication date
01-02-2021
Publisher
Springer Netherlands
Published in
European Journal of Epidemiology / Issue 2/2021
Print ISSN: 0393-2990
Electronic ISSN: 1573-7284
DOI
https://doi.org/10.1007/s10654-020-00677-6

Other articles of this Issue 2/2021

European Journal of Epidemiology 2/2021 Go to the issue