Skip to main content
Top
Published in: BMC Medical Informatics and Decision Making 1/2012

Open Access 01-12-2012 | Software

A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script

Authors: Hiromasa Horiguchi, Hideo Yasunaga, Hideki Hashimoto, Kazuhiko Ohe

Published in: BMC Medical Informatics and Decision Making | Issue 1/2012

Login to get access

Abstract

Background

Secondary use of large scale administrative data is increasingly popular in health services and clinical research, where a user-friendly tool for data management is in great demand. MapReduce technology such as Hadoop is a promising tool for this purpose, though its use has been limited by the lack of user-friendly functions for transforming large scale data into wide table format, where each subject is represented by one row, for use in health services and clinical research. Since the original specification of Pig provides very few functions for column field management, we have developed a novel system called GroupFilterFormat to handle the definition of field and data content based on a Pig Latin script. We have also developed, as an open-source project, several user-defined functions to transform the table format using GroupFilterFormat and to deal with processing that considers date conditions.

Results

Having prepared dummy discharge summary data for 2.3 million inpatients and medical activity log data for 950 million events, we used the Elastic Compute Cloud environment provided by Amazon Inc. to execute processing speed and scaling benchmarks. In the speed benchmark test, the response time was significantly reduced and a linear relationship was observed between the quantity of data and processing time in both a small and a very large dataset. The scaling benchmark test showed clear scalability. In our system, doubling the number of nodes resulted in a 47% decrease in processing time.

Conclusions

Our newly developed system is widely accessible as an open resource. This system is very simple and easy to use for researchers who are accustomed to using declarative command syntax for commercial statistical software and Structured Query Language. Although our system needs further sophistication to allow more flexibility in scripts and to improve efficiency in data processing, it shows promise in facilitating the application of MapReduce technology to efficient data processing with large scale administrative data in health services and clinical research.
Appendix
Available only for authorised users
Literature
1.
go back to reference Hernán MA: With great data comes great responsibility: publishing comparative effectiveness research in epidemiology. Epidemiology. 2011, 22: 290-291. 10.1097/EDE.0b013e3182114039.CrossRefPubMedPubMedCentral Hernán MA: With great data comes great responsibility: publishing comparative effectiveness research in epidemiology. Epidemiology. 2011, 22: 290-291. 10.1097/EDE.0b013e3182114039.CrossRefPubMedPubMedCentral
2.
go back to reference Weiss NS: The new world of data linkages in clinical epidemiology: are we being brave or foolhardy?. Epidemiology. 2011, 22: 292-294. 10.1097/EDE.0b013e318210aca5.CrossRefPubMed Weiss NS: The new world of data linkages in clinical epidemiology: are we being brave or foolhardy?. Epidemiology. 2011, 22: 292-294. 10.1097/EDE.0b013e318210aca5.CrossRefPubMed
3.
go back to reference Dreyer NA: Making observational studies count: shaping the future of comparative effectiveness research. Epidemiology. 2011, 22: 295-297. 10.1097/EDE.0b013e3182126569.CrossRefPubMed Dreyer NA: Making observational studies count: shaping the future of comparative effectiveness research. Epidemiology. 2011, 22: 295-297. 10.1097/EDE.0b013e3182126569.CrossRefPubMed
4.
go back to reference Stürmer T, Jonsson FM, Poole C, Brookhart MA: Nonexperimental comparative effectiveness research using linked healthcare databases. Epidemiology. 2011, 22: 298-301. 10.1097/EDE.0b013e318212640c.CrossRefPubMedPubMedCentral Stürmer T, Jonsson FM, Poole C, Brookhart MA: Nonexperimental comparative effectiveness research using linked healthcare databases. Epidemiology. 2011, 22: 298-301. 10.1097/EDE.0b013e318212640c.CrossRefPubMedPubMedCentral
5.
go back to reference Ray WA: Improving automated database studies. Epidemiology. 2011, 22: 302-304. 10.1097/EDE.0b013e31820f31e1.CrossRefPubMed Ray WA: Improving automated database studies. Epidemiology. 2011, 22: 302-304. 10.1097/EDE.0b013e31820f31e1.CrossRefPubMed
6.
go back to reference Matsuda S, Ishikawa KB, Kuwabara K, Fujimori K, Fushimi K, Hashimoto H: Development and use of the Japanese case-mix system. Eurohealth. 2008, 14: 25-30. Matsuda S, Ishikawa KB, Kuwabara K, Fujimori K, Fushimi K, Hashimoto H: Development and use of the Japanese case-mix system. Eurohealth. 2008, 14: 25-30.
7.
go back to reference Yasunaga H, Hashimoto H, Horiguchi H, Miyata S, Matsuda S: Variation in cancer surgical outcomes associated with physician and nurse staffing: a retrospective observational study using the Japanese Diagnosis Procedure Combination Database. BMC Health Serv Res. 2012, 12: 129-10.1186/1472-6963-12-129.CrossRefPubMedPubMedCentral Yasunaga H, Hashimoto H, Horiguchi H, Miyata S, Matsuda S: Variation in cancer surgical outcomes associated with physician and nurse staffing: a retrospective observational study using the Japanese Diagnosis Procedure Combination Database. BMC Health Serv Res. 2012, 12: 129-10.1186/1472-6963-12-129.CrossRefPubMedPubMedCentral
8.
go back to reference Sumitani M, Uchida K, Yasunaga H, Horiguchi H, Kusakabe Y, Matsuda S, Yamada Y: Prevalence of malignant hyperthermia and relationship with anesthetics in Japan: data from the Diagnosis Procedure Combination Database. Anesthesiology. 2011, 114: 84-90. 10.1097/ALN.0b013e318200197d.CrossRefPubMed Sumitani M, Uchida K, Yasunaga H, Horiguchi H, Kusakabe Y, Matsuda S, Yamada Y: Prevalence of malignant hyperthermia and relationship with anesthetics in Japan: data from the Diagnosis Procedure Combination Database. Anesthesiology. 2011, 114: 84-90. 10.1097/ALN.0b013e318200197d.CrossRefPubMed
9.
go back to reference Nagase Y, Yasunaga H, Horiguchi H, Hashimoto H, Shoda N, Kadono Y, Matsuda S, Nakamura K, Tanaka S: Risk factors of pulmonary embolism and the effects of fondaparinux after total hip and knee arthroplasty: a retrospective observational study using a national database in Japan. J Bone Joint Surg Am. 2011, 92 (146): (1-7. Nagase Y, Yasunaga H, Horiguchi H, Hashimoto H, Shoda N, Kadono Y, Matsuda S, Nakamura K, Tanaka S: Risk factors of pulmonary embolism and the effects of fondaparinux after total hip and knee arthroplasty: a retrospective observational study using a national database in Japan. J Bone Joint Surg Am. 2011, 92 (146): (1-7.
10.
go back to reference Horiguchi H, Yasunaga H, Hashimoto H, Matsuda S: Incidence of severe adverse events requiring hospital care after trastuzumab infusion for metastatic breast cancer: a nationwide survey using an administrative claim database. Breast J. 2011, 17: 683-685. 10.1111/j.1524-4741.2011.01170.x.CrossRefPubMed Horiguchi H, Yasunaga H, Hashimoto H, Matsuda S: Incidence of severe adverse events requiring hospital care after trastuzumab infusion for metastatic breast cancer: a nationwide survey using an administrative claim database. Breast J. 2011, 17: 683-685. 10.1111/j.1524-4741.2011.01170.x.CrossRefPubMed
11.
go back to reference Sugihara H, Yasunaga H, Horiguchi H, Nishimatsu H, Kume H, Matsuda S, Homma Y: Impact of hospital volume and laser usage on postoperative complications and in-hospital mortality after transurethral surgery of benign prostate hyperplasia: Japanese Diagnosis Procedure Combination Database. J Urol. 2011, 185: 2248-2253. 10.1016/j.juro.2011.01.080.CrossRefPubMed Sugihara H, Yasunaga H, Horiguchi H, Nishimatsu H, Kume H, Matsuda S, Homma Y: Impact of hospital volume and laser usage on postoperative complications and in-hospital mortality after transurethral surgery of benign prostate hyperplasia: Japanese Diagnosis Procedure Combination Database. J Urol. 2011, 185: 2248-2253. 10.1016/j.juro.2011.01.080.CrossRefPubMed
12.
go back to reference Kuwabara K, Matsuda S, Fushimi K, Ishikawa KB, Horiguchi H, Fujimori K, Yasunaga H, Miyata H: Quantitative assessment of the advantages of laparoscopic gastrectomy and the impact of volume-related hospital characteristics on resource use and outcomes of gastrectomy patients in Japan. Ann Surg. 2011, 253: 64-70. 10.1097/SLA.0b013e318204e524.CrossRefPubMed Kuwabara K, Matsuda S, Fushimi K, Ishikawa KB, Horiguchi H, Fujimori K, Yasunaga H, Miyata H: Quantitative assessment of the advantages of laparoscopic gastrectomy and the impact of volume-related hospital characteristics on resource use and outcomes of gastrectomy patients in Japan. Ann Surg. 2011, 253: 64-70. 10.1097/SLA.0b013e318204e524.CrossRefPubMed
13.
go back to reference ISO/IEC 9075-*: Database languages SQL. 2003, Geneva, Switzerland: ISO ISO/IEC 9075-*: Database languages SQL. 2003, Geneva, Switzerland: ISO
15.
go back to reference Dean J, Ghemawat S: MapReduce: a flexible data processing tool. Commun ACM. 2010, 53: 72-77.CrossRef Dean J, Ghemawat S: MapReduce: a flexible data processing tool. Commun ACM. 2010, 53: 72-77.CrossRef
16.
go back to reference Taylor RC: An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinforma. 2010, 11 (Suppl 12): S1-10.1186/1471-2105-11-S12-S1.CrossRef Taylor RC: An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinforma. 2010, 11 (Suppl 12): S1-10.1186/1471-2105-11-S12-S1.CrossRef
18.
go back to reference Dean J, Ghemawat S: MapReduce: simplified data processing on large clusters. Commun ACM. 2008, 57: 107-113.CrossRef Dean J, Ghemawat S: MapReduce: simplified data processing on large clusters. Commun ACM. 2008, 57: 107-113.CrossRef
19.
go back to reference Sunderam VS: PVM: a framework for parallel distributed computing. Concurrency: Practice and Experience. 1990, 2: 315-339. 10.1002/cpe.4330020404.CrossRef Sunderam VS: PVM: a framework for parallel distributed computing. Concurrency: Practice and Experience. 1990, 2: 315-339. 10.1002/cpe.4330020404.CrossRef
20.
go back to reference Olston C, Reed B, Srivastava U, Kumar R, Tomkins A: Pig Latin: a not-so-foreign language for data processing. In Proceedings of the: ACM SIGMOD International Conference on Management of Data: 9–12 June 2008; Vancouver. ACM;. 2008, 2008: 1099-1110. Olston C, Reed B, Srivastava U, Kumar R, Tomkins A: Pig Latin: a not-so-foreign language for data processing. In Proceedings of the: ACM SIGMOD International Conference on Management of Data: 9–12 June 2008; Vancouver. ACM;. 2008, 2008: 1099-1110.
Metadata
Title
A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script
Authors
Hiromasa Horiguchi
Hideo Yasunaga
Hideki Hashimoto
Kazuhiko Ohe
Publication date
01-12-2012
Publisher
BioMed Central
Published in
BMC Medical Informatics and Decision Making / Issue 1/2012
Electronic ISSN: 1472-6947
DOI
https://doi.org/10.1186/1472-6947-12-151

Other articles of this Issue 1/2012

BMC Medical Informatics and Decision Making 1/2012 Go to the issue