Published in:
01-12-2012
Understanding the Limits of Large Datasets
Authors:
Catherine M. Sanders, Sidney L. Saltzstein, Matthew M. Schultzel, Duy H. Nguyen, Helen Shi Stafford, Georgia Robins Sadler
Published in:
Journal of Cancer Education
|
Issue 4/2012
Login to get access
Abstract
Many health professionals use large datasets to answer behavioral, translational, or clinical questions. Understanding the impact of missing data in large databases, such as disease registries, can avoid erroneous interpretations of these data. Using the California Cancer Registry, the authors selected seven common cancers, seven sociodemographic and clinical variables, and the top three reporting sources, as examples of the type of data that would be deemed critical to most studies. The gender variable had no missing data, followed by age (<0.1 % missing), ethnicity (1.7 %), stage (9.8 %), differentiation (39.1 %), and birthplace (41.1 %). Reports from hospitals and clinics had the lowest percentages of missing data. Users of large datasets should anticipate the limitations of missing data to prevent methodological flaws and misinterpretations of research findings. Knowledge of what and how much data may be missing in large datasets can help prevent errors in research conclusions, while better guiding treatment modalities and public health policies and programs.