Top

BMC Medical Informatics and Decision Making

Published in:

Open Access 01-12-2017 | Research article

Automatic identification of variables in epidemiological datasets using logic regression

Authors: Matthias W. Lorenz, Negin Ashtiani Abdi, Frank Scheckenbach, Anja Pflug, Alpaslan Bülbül, Alberico L. Catapano, Stefan Agewall, Marat Ezhov, Michiel L. Bots, Stefan Kiechl, Andreas Orth, on behalf of the PROG-IMT study group

Published in: BMC Medical Informatics and Decision Making | Issue 1/2017

Abstract

Background

For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable.

Methods

For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated.

Results

In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables.

Conclusions

We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.

Available only for authorised users

Blettner M, Sauerbrei W, Schlehofer B, Scheuchenpflug T, Friedenreich C. Traditional reviews, meta-analyses and pooled analyses in epidemiology. Int J Epidemiol. 1999;28:1–9.CrossRefPubMed

Fortier I, Doiron D, Little J, Ferretti V, L’Heureux F, Stolk RP, Knoppers BM, Hudson TJ, Burton PR, International Harmonization Initiative. Is rigorous retrospective harmonization possible? Application of the DataSHaPER approach across 53 large studies. Int J Epidemiol. 2011;40:1314–28.CrossRefPubMedPubMedCentral

Doiron D, Burton P, Marcon Y, Gaye A, Wolffenbuttel BH, Perola M, Stolk RP, Foco L, Minelli C, Waldenberger M, Holle R, Kvaløy K, Hillege HL, Tassé AM, Ferretti V, Fortier I. Data harmonization and federated analysis of population-based studies: the BioSHaRE project. Emerg Themes Epidemiol. 2013;10:12.CrossRefPubMedPubMedCentral

Bosch-Capblanch X. Harmonisation of variables names prior to conducting statistical analyses with multiple datasets: an automated approach. BMC Med Inform Decis Mak. 2011;11:33.CrossRefPubMedPubMedCentral

Lorenz MW, Bickel H, Bots ML, Breteler MMB, Catapano AL, Desvarieux M, Hedblad B, Iglseder B, Johnsen SH, Juraska M, Kiechl S, Mathiesen EB, Norata GD, Grigore L, Polak J, Poppert H, Rosvall M, Rundek T, Sacco RL, Sander D, Sitzer M, Steinmetz H, Stensland E, Willeit J, Witteman J, Yanez D, Thompson SG, The PROG-IMT Study Group. Individual progression of carotid intima media thickness as a surrogate for vascular risk (PROG-IMT) – rationale and design of a meta-analysis project. Am Heart J. 2010;159:730–6.CrossRefPubMedPubMedCentral

Ruczinski I, Kooperberg C, LeBlanc M. Logic regression. J Comput Graphical Stat. 2003;12:475–511.CrossRef

Kooperberg C, Ruczinski I. Identifying interacting SNPs using Monte Carlo logic regression. Genet Epidemiol. 2005;28:157–70.CrossRefPubMed

Kooperberg C, Bis JC, Marciante KD, Heckbert SR, Lumley T, Psaty BM. Logic regression for analysis of the association between genetic variation in the renin-angiotensin system and myocardial infarction or stroke. Am J Epidemiol. 2007;165:334–43.CrossRefPubMed

Dinu I, Mahasirimongkol S, Liu Q, Yanai H, Sharaf Eldin N, Kreiter E, Wu X, Jabbari S, Tokunaga K, Yasui Y. SNP-SNP interactions discovered by logic regression explain Crohn’s disease genetics. PLoS One. 2012;7:e43035.CrossRefPubMedPubMedCentral

10.

Sarbakhsh P, Mehrabi Y, Daneshpour MS, Zayeri F, Zarkesh M. Logic regression analysis of association of gene polymorphisms with low HDL: Tehran Lipid and Glucose Study. Gene. 2013;513:278–81.CrossRefPubMed

11.

Zhi S, Li Q, Yasui Y, Edge T, Topp E, Neumann NF. Assessing host-specificity of Escherichia coli using a supervised learning logic-regression-based analysis of single nucleotide polymorphisms in intergenic regions. Mol Phylogenet Evol. 2015;92:72–81.CrossRefPubMed

12.

Janes H, Pepe M, Kooperberg C, Newcomb P. Identifying target populations for screening or not screening using logic regression. Stat Med. 2005;24:1321–38.CrossRefPubMed

13.

Riley RD, Sauerbrei W, Altman DG. Prognostic markers in cancer: the evolution of evidence from single studies to meta-analysis, and beyond. Br J Cancer. 2009;100:1219–29.CrossRefPubMedPubMedCentral

14.

Stewart LA, Clarke M, Rovers M, Riley RD, Simmonds M, Stewart G, Tierney JF, PRISMA-IPD Development Group. Preferred reporting items for systematic review and meta-analyses of individual participant data: the PRISMA-IPD statement. JAMA. 2015;313:1657–65.CrossRefPubMed

15.

Simmonds M, Stewart G, Stewart L. A decade of individual participant data meta-analyses: A review of current practice. Contemp Clin Trials. 2015 Jun 17 [Epub ahead of print].

16.

Boccia S, De Feo E, Gallì P, Gianfagna F, Amore R, Ricciardi G. A systematic review evaluating the methodological aspects of meta-analyses of genetic association studies in cancer research. Eur J Epidemiol. 2010;25:765–75.CrossRefPubMed

17.

Debray TP, Moons KG, Abo-Zaid GM, Koffijberg H, Riley RD. Individual participant data meta-analysis for a binary outcome: one-stage or two-stage? PLoS One. 2013;8:e60650.CrossRefPubMedPubMedCentral

18.

Thomas D, Radji S, Benedetti A. Systematic review of methods for individual patient data meta- analysis with binary outcomes. BMC Med Res Methodol. 2014;14:79.CrossRefPubMedPubMedCentral

19.

Ahmed I, Debray TP, Moons KG, Riley RD. Developing and validating risk prediction models in an individual participant data meta-analysis. BMC Med Res Methodol. 2014;14:3.CrossRefPubMedPubMedCentral

Title: Automatic identification of variables in epidemiological datasets using logic regression
Authors: Matthias W. Lorenz
Negin Ashtiani Abdi
Frank Scheckenbach
Anja Pflug
Alpaslan Bülbül
Alberico L. Catapano
Stefan Agewall
Marat Ezhov
Michiel L. Bots
Stefan Kiechl
Andreas Orth
on behalf of the PROG-IMT study group
Publication date: 01-12-2017
Publisher: BioMed Central
Published in: BMC Medical Informatics and Decision Making / Issue 1/2017
Electronic ISSN: 1472-6947
DOI: https://doi.org/10.1186/s12911-017-0429-1

Keynote webinar | Spotlight on sleep in brain health

Springer Medicine

Automatic identification of variables in epidemiological datasets using logic regression

Abstract

Background

Methods

Results

Conclusions

Keynote webinar | Spotlight on sleep in brain health

Springer Medicine

Abstract

Background

Methods

Results

Conclusions

Please log in to get access to this content

Other articles of this Issue 1/2017

Health care public reporting utilization – user clusters, web trails, and usage barriers on Germany’s public reporting portal Weisse-Liste.de

Using multiclass classification to automate the identification of patient safety incident reports by type and severity

Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation

Validation of an algorithm that determines stroke diagnostic code accuracy in a Japanese hospital-based cancer registry using electronic medical records

The role and benefits of accessing primary care patient records during unscheduled care: a systematic review

Economic and organizational impact of a clinical decision support system on laboratory test ordering