Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2020

Open Access 01-12-2020 | Post-Exposure Prophylaxis | Research article

Data cleaning process for HIV-indicator data extracted from DHIS2 national reporting system: a case study of Kenya

Authors: Milka Bochere Gesicho, Martin Chieng Were, Ankica Babic

Published in: BMC Medical Informatics and Decision Making | Issue 1/2020

Login to get access

Abstract

Background

The District Health Information Software-2 (DHIS2) is widely used by countries for national-level aggregate reporting of health-data. To best leverage DHIS2 data for decision-making, countries need to ensure that data within their systems are of the highest quality. Comprehensive, systematic, and transparent data cleaning approaches form a core component of preparing DHIS2 data for analyses. Unfortunately, there is paucity of exhaustive and systematic descriptions of data cleaning processes employed on DHIS2-based data. The aim of this study was to report on methods and results of a systematic and replicable data cleaning approach applied on HIV-data gathered within DHIS2 from 2011 to 2018 in Kenya, for secondary analyses.

Methods

Six programmatic area reports containing HIV-indicators were extracted from DHIS2 for all care facilities in all counties in Kenya from 2011 to 2018. Data variables extracted included reporting rate, reporting timeliness, and HIV-indicator data elements per facility per year. 93,179 facility-records from 11,446 health facilities were extracted from year 2011 to 2018. Van den Broeck et al.’s framework, involving repeated cycles of a three-phase process (data screening, data diagnosis and data treatment), was employed semi-automatically within a generic five-step data-cleaning sequence, which was developed and applied in cleaning the extracted data. Various quality issues were identified, and Friedman analysis of variance conducted to examine differences in distribution of records with selected issues across eight years.

Results

Facility-records with no data accounted for 50.23% and were removed. Of the remaining, 0.03% had over 100% in reporting rates. Of facility-records with reporting data, 0.66% and 0.46% were retained for voluntary medical male circumcision and blood safety programmatic area reports respectively, given that few facilities submitted data or offered these services. Distribution of facility-records with selected quality issues varied significantly by programmatic area (p < 0.001). The final clean dataset obtained was suitable to be used for subsequent secondary analyses.

Conclusions

Comprehensive, systematic, and transparent reporting of cleaning-process is important for validity of the research studies as well as data utilization. The semi-automatic procedures used resulted in improved data quality for use in secondary analyses, which could not be secured by automated procedures solemnly.
Appendix
Available only for authorised users
Literature
1.
go back to reference Hotchkiss DR, Diana ML, Foreit KGF. How can routine health information systems improve health systems functioning in lowand middle-income countries? Assessing the evidence base. Adv Health Care Manag. 2012;12:25–58.CrossRef Hotchkiss DR, Diana ML, Foreit KGF. How can routine health information systems improve health systems functioning in lowand middle-income countries? Assessing the evidence base. Adv Health Care Manag. 2012;12:25–58.CrossRef
2.
go back to reference De Lay PR. Nicole Massoud DLR, Carae KAS and M. Strategic information for HIV programmes. In: The HIV pandemic: local and Global Implications. Oxford Scholarship Online; 2007. p. 146. De Lay PR. Nicole Massoud DLR, Carae KAS and M. Strategic information for HIV programmes. In: The HIV pandemic: local and Global Implications. Oxford Scholarship Online; 2007. p. 146.
3.
go back to reference Beck EJ, Mays N, Whiteside A, Zuniga JM. The HIV Pandemic: Local and Global Implications. Oxford: Oxford University Press; 2009. p. 1–840. Beck EJ, Mays N, Whiteside A, Zuniga JM. The HIV Pandemic: Local and Global Implications. Oxford: Oxford University Press; 2009. p. 1–840.
4.
go back to reference Granich R, Gupta S, Hall I, Aberle-Grasse J, Hader S, Mermin J. Status and methodology of publicly available national HIV care continua and 90–90-90 targets: a systematic review. PLoS Med. 2017;14:e1002253.CrossRef Granich R, Gupta S, Hall I, Aberle-Grasse J, Hader S, Mermin J. Status and methodology of publicly available national HIV care continua and 90–90-90 targets: a systematic review. PLoS Med. 2017;14:e1002253.CrossRef
5.
go back to reference Peersman G, Rugg D, Erkkola T, Kirwango E, Yang J. Are the investments in monitoring and evaluation systems paying off? Jaids. 2009;52(Suppl 2):8796. Peersman G, Rugg D, Erkkola T, Kirwango E, Yang J. Are the investments in monitoring and evaluation systems paying off? Jaids. 2009;52(Suppl 2):8796.
6.
go back to reference Kariuki JM, Manders E-J, Richards J, Oluoch T, Kimanga D, Wanyee S, et al. Automating indicator data reporting from health facility EMR to a national aggregate data system in Kenya: an Interoperability field-test using OpenMRS and DHIS2. Online J Public Health Inform. 2016;8:e188.CrossRef Kariuki JM, Manders E-J, Richards J, Oluoch T, Kimanga D, Wanyee S, et al. Automating indicator data reporting from health facility EMR to a national aggregate data system in Kenya: an Interoperability field-test using OpenMRS and DHIS2. Online J Public Health Inform. 2016;8:e188.CrossRef
7.
go back to reference Karuri J, Waiganjo P, Orwa D, Manya A. DHIS2: the tool to improve health data demand and use in Kenya. J Health Inform Dev Ctries. 2014;8:38–60. Karuri J, Waiganjo P, Orwa D, Manya A. DHIS2: the tool to improve health data demand and use in Kenya. J Health Inform Dev Ctries. 2014;8:38–60.
8.
go back to reference Dehnavieh R, Haghdoost AA, Khosravi A, Hoseinabadi F, Rahimi H, Poursheikhali A, et al. The District Health Information System (DHIS2): a literature review and meta-synthesis of its strengths and operational challenges based on the experiences of 11 countries. Health Inf Manag. 2019;48:62–75.PubMed Dehnavieh R, Haghdoost AA, Khosravi A, Hoseinabadi F, Rahimi H, Poursheikhali A, et al. The District Health Information System (DHIS2): a literature review and meta-synthesis of its strengths and operational challenges based on the experiences of 11 countries. Health Inf Manag. 2019;48:62–75.PubMed
9.
go back to reference Benchimol EI, Smeeth L, Guttmann A, Harron K, Moher D, Petersen I, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. PLOS Med. 2015;12:e1001885.CrossRef Benchimol EI, Smeeth L, Guttmann A, Harron K, Moher D, Petersen I, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. PLOS Med. 2015;12:e1001885.CrossRef
10.
go back to reference Dziadkowiec O, Callahan T, Ozkaynak M, Reeder B, Welton J. Using a data quality framework to clean data extracted from the electronic health record: a case study. eGEMs. 2016;4(1):11.CrossRef Dziadkowiec O, Callahan T, Ozkaynak M, Reeder B, Welton J. Using a data quality framework to clean data extracted from the electronic health record: a case study. eGEMs. 2016;4(1):11.CrossRef
13.
go back to reference Maïga A, Jiwani SS, Mutua MK, Porth TA, Taylor CM, Asiki G, et al. Generating statistics from health facility data: the state of routine health information systems in Eastern and Southern Africa. BMJ Global Health. 2019;4:e001849.CrossRef Maïga A, Jiwani SS, Mutua MK, Porth TA, Taylor CM, Asiki G, et al. Generating statistics from health facility data: the state of routine health information systems in Eastern and Southern Africa. BMJ Global Health. 2019;4:e001849.CrossRef
14.
go back to reference Gloyd S, Wagenaar BH, Woelk GB, Kalibala S. Opportunities and challenges in conducting secondary analysis of HIV programmes using data from routine health information systems and personal health information. J Int AIDS Soc. 2016;19(Suppl 4):1–6. Gloyd S, Wagenaar BH, Woelk GB, Kalibala S. Opportunities and challenges in conducting secondary analysis of HIV programmes using data from routine health information systems and personal health information. J Int AIDS Soc. 2016;19(Suppl 4):1–6.
15.
go back to reference Fan W, Geerts F. Foundations of data quality management. Synth Lect Data Manag. 2012;4:1–217.CrossRef Fan W, Geerts F. Foundations of data quality management. Synth Lect Data Manag. 2012;4:1–217.CrossRef
16.
go back to reference Githinji S, Oyando R, Malinga J, Ejersa W, Soti D, Rono J, et al. Completeness of malaria indicator data reporting via the District Health Information Software 2 in Kenya, 2011–2015. BMC Malar J. 2017;16:1–11.CrossRef Githinji S, Oyando R, Malinga J, Ejersa W, Soti D, Rono J, et al. Completeness of malaria indicator data reporting via the District Health Information Software 2 in Kenya, 2011–2015. BMC Malar J. 2017;16:1–11.CrossRef
17.
go back to reference Wilhelm JA, Qiu M, Paina L, Colantuoni E, Mukuru M, Ssengooba F, et al. The impact of PEPFAR transition on HIV service delivery at health facilities in Uganda. PLoS ONE. 2019;14:e0223426.CrossRef Wilhelm JA, Qiu M, Paina L, Colantuoni E, Mukuru M, Ssengooba F, et al. The impact of PEPFAR transition on HIV service delivery at health facilities in Uganda. PLoS ONE. 2019;14:e0223426.CrossRef
18.
go back to reference Maina JK, Macharia PM, Ouma PO, Snow RW, Okiro EA. Coverage of routine reporting on malaria parasitological testing in Kenya, 2015–2016. Glob Health Action. 2017;10:1413266.CrossRef Maina JK, Macharia PM, Ouma PO, Snow RW, Okiro EA. Coverage of routine reporting on malaria parasitological testing in Kenya, 2015–2016. Glob Health Action. 2017;10:1413266.CrossRef
19.
go back to reference Thawer SG, Chacky F, Runge M, Reaves E, Mandike R, Lazaro S, et al. Sub-national stratification of malaria risk in mainland Tanzania: a simplified assembly of survey and routine data. Malar J. 2020;19:177.CrossRef Thawer SG, Chacky F, Runge M, Reaves E, Mandike R, Lazaro S, et al. Sub-national stratification of malaria risk in mainland Tanzania: a simplified assembly of survey and routine data. Malar J. 2020;19:177.CrossRef
20.
go back to reference Shikuku DN, Muganda M, Amunga SO, Obwanda EO, Muga A, Matete T, et al. Door-to-door immunization strategy for improving access and utilization of immunization services in hard-to-reach areas: a case of Migori County, Kenya. BMC Public Health. 2019;19:1–11.CrossRef Shikuku DN, Muganda M, Amunga SO, Obwanda EO, Muga A, Matete T, et al. Door-to-door immunization strategy for improving access and utilization of immunization services in hard-to-reach areas: a case of Migori County, Kenya. BMC Public Health. 2019;19:1–11.CrossRef
21.
go back to reference Van Den Broeck J, Cunningham SA, Eeckels R, Herbst K. Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Med. 2005;2:966–70. Van Den Broeck J, Cunningham SA, Eeckels R, Herbst K. Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Med. 2005;2:966–70.
22.
go back to reference Leahey E, Entwisle B, Einaudi P. Diversity in everyday research practice: the case of data editing. Sociol Methods Res. 2003;32:64–89.CrossRef Leahey E, Entwisle B, Einaudi P. Diversity in everyday research practice: the case of data editing. Sociol Methods Res. 2003;32:64–89.CrossRef
23.
go back to reference Wang RY, Strong DM. Beyond accuracy: what data quality means to data consumers. J Manag Inf Syst. 1996;12:5–33.CrossRef Wang RY, Strong DM. Beyond accuracy: what data quality means to data consumers. J Manag Inf Syst. 1996;12:5–33.CrossRef
24.
go back to reference Langouri MA, Zheng Z, Chiang F, Golab L, Szlichta J. Contextual data cleaning. In 2018 IEEE 34th INTERNATIONAL CONFERENCE DATA ENGINEERING Work. 2018. p. 21–4. Langouri MA, Zheng Z, Chiang F, Golab L, Szlichta J. Contextual data cleaning. In 2018 IEEE 34th INTERNATIONAL CONFERENCE DATA ENGINEERING Work. 2018. p. 21–4.
25.
go back to reference Strong DM, Lee YW, Wang RY. Data quality in context. Commun ACM. 1997;40:103–10.CrossRef Strong DM, Lee YW, Wang RY. Data quality in context. Commun ACM. 1997;40:103–10.CrossRef
26.
go back to reference Bertossi L, Rizzolo F, Jiang L. Data quality is context dependent. In Lecture notes in business information processing. 2011. p. 52–67. Bertossi L, Rizzolo F, Jiang L. Data quality is context dependent. In Lecture notes in business information processing. 2011. p. 52–67.
27.
go back to reference Bolchini C, Curino CA, Orsi G, Quintarelli E, Rossato R, Schreiber FA, et al. And what can context do for data? Commun ACM. 2009;52:136–40.CrossRef Bolchini C, Curino CA, Orsi G, Quintarelli E, Rossato R, Schreiber FA, et al. And what can context do for data? Commun ACM. 2009;52:136–40.CrossRef
28.
go back to reference Chapman AD. Principles and methods of data cleaning primary species data, 1st ed. Report for the Global Biodiversity Information Facility. GBIF; 2005. Chapman AD. Principles and methods of data cleaning primary species data, 1st ed. Report for the Global Biodiversity Information Facility. GBIF; 2005.
29.
go back to reference Zhang S, Zhang C, Yang Q. Data preparation for data mining. Appl Artif Intell. 2003;17:375–81.CrossRef Zhang S, Zhang C, Yang Q. Data preparation for data mining. Appl Artif Intell. 2003;17:375–81.CrossRef
30.
go back to reference Fayyad U, Piatetsky-Shapiro G, Smyth P. Knowledge discovery and data mining: towards a unifying framework. 1996. 31. Fayyad U, Piatetsky-Shapiro G, Smyth P. Knowledge discovery and data mining: towards a unifying framework. 1996. 31.
31.
go back to reference Oliveira P, Rodrigues F, Galhardas H. A taxonomy of data quality problems. In: 2nd International work data information quality. 2005. p. 219 Oliveira P, Rodrigues F, Galhardas H. A taxonomy of data quality problems. In: 2nd International work data information quality. 2005. p. 219
33.
go back to reference Müller H, Freytag J-C. Problems, methods, and challenges in comprehensive data cleansing challenges. Technical Report HUB-IB-164, Humboldt University, Berlin. 2003. p. 1–23. Müller H, Freytag J-C. Problems, methods, and challenges in comprehensive data cleansing challenges. Technical Report HUB-IB-164, Humboldt University, Berlin. 2003. p. 1–23.
34.
go back to reference Seheult AH, Green PJ, Rousseeuw PJ, Leroy AM. Robust regression and outlier detection. J R Stat Soc Ser A Stat Soc. 1989;152:133.CrossRef Seheult AH, Green PJ, Rousseeuw PJ, Leroy AM. Robust regression and outlier detection. J R Stat Soc Ser A Stat Soc. 1989;152:133.CrossRef
35.
go back to reference Hellerstein JM. Quantitative data cleaning for large databases. United Nations Economics Committee Europe. 2008. 42. Hellerstein JM. Quantitative data cleaning for large databases. United Nations Economics Committee Europe. 2008. 42.
36.
go back to reference Kang H. The prevention and handling of the missing data. Korean J Anesthesiol. 2013;64:402–6.CrossRef Kang H. The prevention and handling of the missing data. Korean J Anesthesiol. 2013;64:402–6.CrossRef
37.
go back to reference Chu X, Ilyas IF, Krishnan S, Wang J. Data cleaning: overview and emerging challenges. In: Proceedings of the ACM SIGMOD international conference on management of data. New York: ACM Press; 2016. p. 2201–6. Chu X, Ilyas IF, Krishnan S, Wang J. Data cleaning: overview and emerging challenges. In: Proceedings of the ACM SIGMOD international conference on management of data. New York: ACM Press; 2016. p. 2201–6.
38.
go back to reference Vassiliadis P, Vagena Z, Skiadopoulos S, Karayannidis N, Sellis T. Arktos: a tool for data cleaning and transformation in data warehouse environments. IEEE Data Eng Bull. 2000;23:2000.1.109.2911 Vassiliadis P, Vagena Z, Skiadopoulos S, Karayannidis N, Sellis T. Arktos: a tool for data cleaning and transformation in data warehouse environments. IEEE Data Eng Bull. 2000;23:2000.1.109.2911
39.
go back to reference WHO. Data Quality Review (DQR) Toolkit . WHO. World Health Organization; 2019: who.int/healthinfo/tools_data_analysis/en/. Accessed 5 Mar 2020. WHO. Data Quality Review (DQR) Toolkit . WHO. World Health Organization; 2019: who.int/healthinfo/tools_data_analysis/en/. Accessed 5 Mar 2020.
42.
go back to reference Shanks G, Corbitt B. Understanding data quality: social and cultural aspects. In: 10th Australasian conference on information systems. 1999. p. 785–97. Shanks G, Corbitt B. Understanding data quality: social and cultural aspects. In: 10th Australasian conference on information systems. 1999. p. 785–97.
43.
go back to reference Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 2013;20:144–51.CrossRef Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 2013;20:144–51.CrossRef
44.
go back to reference Savik K, Fan Q, Bliss D, Harms S. Preparing a large data set for analysis: using the minimum data set to study perineal dermatitis. J Adv Nurs. 2005;52(4):399–409.CrossRef Savik K, Fan Q, Bliss D, Harms S. Preparing a large data set for analysis: using the minimum data set to study perineal dermatitis. J Adv Nurs. 2005;52(4):399–409.CrossRef
45.
go back to reference Miao Z, Sathyanarayanan S, Fong E, Paiva W, Delen D. An assessment and cleaning framework for electronic health records data. In: Industrial and systems engineering research conference. 2018. Miao Z, Sathyanarayanan S, Fong E, Paiva W, Delen D. An assessment and cleaning framework for electronic health records data. In: Industrial and systems engineering research conference. 2018.
46.
go back to reference Kulkarni DK. Interpretation and display of research results. Indian J Anaesth. 2016;60:657–61.CrossRef Kulkarni DK. Interpretation and display of research results. Indian J Anaesth. 2016;60:657–61.CrossRef
47.
go back to reference Luo W, Gallagher M, Loveday B, Ballantyne S, Connor JP, Wiles J. Detecting contaminated birthdates using generalized additive models. BMC Bioinform. 2014;12(15):1–9. Luo W, Gallagher M, Loveday B, Ballantyne S, Connor JP, Wiles J. Detecting contaminated birthdates using generalized additive models. BMC Bioinform. 2014;12(15):1–9.
48.
go back to reference Maina I, Wanjal P, Soti D, Kipruto H, Droti B, Boerma T. Using health-facility data to assess subnational coverage of maternal and child health indicators, Kenya. Bull World Health Organ. 2017;95(10):683–94.CrossRef Maina I, Wanjal P, Soti D, Kipruto H, Droti B, Boerma T. Using health-facility data to assess subnational coverage of maternal and child health indicators, Kenya. Bull World Health Organ. 2017;95(10):683–94.CrossRef
49.
go back to reference Bhattacharya AA, Umar N, Audu A, Allen E, Schellenberg JRM, Marchant T. Quality of routine facility data for monitoring priority maternal and newborn indicators in DHIS2: a case study from Gombe State, Nigeria. PLoS ONE. 2019;14:e0211265.CrossRef Bhattacharya AA, Umar N, Audu A, Allen E, Schellenberg JRM, Marchant T. Quality of routine facility data for monitoring priority maternal and newborn indicators in DHIS2: a case study from Gombe State, Nigeria. PLoS ONE. 2019;14:e0211265.CrossRef
Metadata
Title
Data cleaning process for HIV-indicator data extracted from DHIS2 national reporting system: a case study of Kenya
Authors
Milka Bochere Gesicho
Martin Chieng Were
Ankica Babic
Publication date
01-12-2020
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2020
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/s12911-020-01315-7

Other articles of this Issue 1/2020

BMC Medical Informatics and Decision Making 1/2020 Go to the issue