Skip to main content
Top
Published in: Journal of Cancer Education 4/2012

01-12-2012

Understanding the Limits of Large Datasets

Authors: Catherine M. Sanders, Sidney L. Saltzstein, Matthew M. Schultzel, Duy H. Nguyen, Helen Shi Stafford, Georgia Robins Sadler

Published in: Journal of Cancer Education | Issue 4/2012

Login to get access

Abstract

Many health professionals use large datasets to answer behavioral, translational, or clinical questions. Understanding the impact of missing data in large databases, such as disease registries, can avoid erroneous interpretations of these data. Using the California Cancer Registry, the authors selected seven common cancers, seven sociodemographic and clinical variables, and the top three reporting sources, as examples of the type of data that would be deemed critical to most studies. The gender variable had no missing data, followed by age (<0.1 % missing), ethnicity (1.7 %), stage (9.8 %), differentiation (39.1 %), and birthplace (41.1 %). Reports from hospitals and clinics had the lowest percentages of missing data. Users of large datasets should anticipate the limitations of missing data to prevent methodological flaws and misinterpretations of research findings. Knowledge of what and how much data may be missing in large datasets can help prevent errors in research conclusions, while better guiding treatment modalities and public health policies and programs.
Literature
1.
go back to reference American Joint Committee on Cancer (1988) Manual for staging of cancer, Thirdth edn. J.B. Lippincott, Philadelphia American Joint Committee on Cancer (1988) Manual for staging of cancer, Thirdth edn. J.B. Lippincott, Philadelphia
2.
go back to reference Furie B et al (2003) Clinical hematology and oncology. Presentation, diagnosis, and treatment. Churchill Livingstone, Philadelphia Furie B et al (2003) Clinical hematology and oncology. Presentation, diagnosis, and treatment. Churchill Livingstone, Philadelphia
3.
go back to reference Gomez SL, Glaser SL (2005) Quality of cancer registry birthplace data for Hispanics living in the United States. Cancer Causes Control 16(6):713–723PubMedCrossRef Gomez SL, Glaser SL (2005) Quality of cancer registry birthplace data for Hispanics living in the United States. Cancer Causes Control 16(6):713–723PubMedCrossRef
4.
go back to reference Gomez SL et al (2004) Bias in completeness of birthplace data for Asian groups in a population-based cancer registry (United States). Cancer Causes Control 15(3):243–253PubMedCrossRef Gomez SL et al (2004) Bias in completeness of birthplace data for Asian groups in a population-based cancer registry (United States). Cancer Causes Control 15(3):243–253PubMedCrossRef
5.
go back to reference Lin SS, O'Malley CD, Lui SW (2001) Factors associated with missing birthplace information in a population-based cancer registry. Ethn Dis 11(4):598–605PubMed Lin SS, O'Malley CD, Lui SW (2001) Factors associated with missing birthplace information in a population-based cancer registry. Ethn Dis 11(4):598–605PubMed
6.
go back to reference Gomez SL et al (2003) Hospital policy and practice regarding the collection of data on race, ethnicity, and birthplace. Am J Public Health 93(10):1685–1688PubMedCrossRef Gomez SL et al (2003) Hospital policy and practice regarding the collection of data on race, ethnicity, and birthplace. Am J Public Health 93(10):1685–1688PubMedCrossRef
7.
go back to reference Konowitz PM, Petrossian GA, Rose DN (1984) The underreporting of disease and physicians' knowledge of reporting requirements. Public Health Rep 99(1):31–35PubMed Konowitz PM, Petrossian GA, Rose DN (1984) The underreporting of disease and physicians' knowledge of reporting requirements. Public Health Rep 99(1):31–35PubMed
8.
go back to reference Seixas NS, Rosenman KD (1986) Voluntary reporting system for occupational disease: pilot project, evaluation. Public Health Rep 101(3):278–282PubMed Seixas NS, Rosenman KD (1986) Voluntary reporting system for occupational disease: pilot project, evaluation. Public Health Rep 101(3):278–282PubMed
9.
go back to reference Mettlin CJ et al (1997) A comparison of breast, colorectal, lung, and prostate cancers reported to the National Cancer Data Base and the Surveillance, Epidemiology, and End Results Program. Cancer 79(10):2052–2061PubMedCrossRef Mettlin CJ et al (1997) A comparison of breast, colorectal, lung, and prostate cancers reported to the National Cancer Data Base and the Surveillance, Epidemiology, and End Results Program. Cancer 79(10):2052–2061PubMedCrossRef
Metadata
Title
Understanding the Limits of Large Datasets
Authors
Catherine M. Sanders
Sidney L. Saltzstein
Matthew M. Schultzel
Duy H. Nguyen
Helen Shi Stafford
Georgia Robins Sadler
Publication date
01-12-2012
Publisher
Springer-Verlag
Published in
Journal of Cancer Education / Issue 4/2012
Print ISSN: 0885-8195
Electronic ISSN: 1543-0154
DOI
https://doi.org/10.1007/s13187-012-0383-7

Other articles of this Issue 4/2012

Journal of Cancer Education 4/2012 Go to the issue
Webinar | 19-02-2024 | 17:30 (CET)

Keynote webinar | Spotlight on antibody–drug conjugates in cancer

Antibody–drug conjugates (ADCs) are novel agents that have shown promise across multiple tumor types. Explore the current landscape of ADCs in breast and lung cancer with our experts, and gain insights into the mechanism of action, key clinical trials data, existing challenges, and future directions.

Dr. Véronique Diéras
Prof. Fabrice Barlesi
Developed by: Springer Medicine