Abstract
A schema mapping is a high-level specification of the relationship between a source schema and a target schema. Recently, a line of research has emerged that aims at deriving schema mappings automatically or semi-automatically with the help of data examples, that is, pairs consisting of a source instance and a target instance that depict, in some precise sense, the intended behavior of the schema mapping. Several different uses of data examples for deriving, refining, or illustrating a schema mapping have already been proposed and studied.
In this article, we use the lens of computational learning theory to systematically investigate the problem of obtaining algorithmically a schema mapping from data examples. Our aim is to leverage the rich body of work on learning theory in order to develop a framework for exploring the power and the limitations of the various algorithmic methods for obtaining schema mappings from data examples. We focus on GAV schema mappings, that is, schema mappings specified by GAV (Global-As-View) constraints. GAV constraints are the most basic and the most widely supported language for specifying schema mappings. We present an efficient algorithm for learning GAV schema mappings using Angluin's model of exact learning with membership and equivalence queries. This is optimal, since we show that neither membership queries nor equivalence queries suffice, unless the source schema consists of unary relations only. We also obtain results concerning the learnability of schema mappings in the context of Valiant's well-known PAC (Probably-Approximately-Correct) learning model, and concerning the learnability of restricted classes of GAV schema mappings. Finally, as a byproduct of our work, we show that there is no efficient algorithm for approximating the shortest GAV schema mapping fitting a given set of examples, unless the source schema consists of unary relations only.
- Alekhnovich, M., Braverman, M., Feldman, V., Klivans, A. R., and Pitassi, T. 2008. The complexity of properly learning simple concept classes. J. Comput. Syst. Sci. 74, 1, 16--34. Google ScholarDigital Library
- Alexe, B., Chiticariu, L., Miller, R. J., and Tan, W. C. 2008. Muse: Mapping understanding and design by example. In Proceedings of the 24th International Conference on Data Engineering (ICDE'08). 10--19. Google ScholarDigital Library
- Alexe, B., Kolaitis, P. G., and Tan, W. C. 2010. Characterizing schema mappings via data examples. In Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'10). 261--272. Google ScholarDigital Library
- Alexe, B., ten Cate, B., Kolaitis, P. G., and Tan, W. C. 2011. Designing and refining schema mappings via data examples. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'11). 133--144. Google ScholarDigital Library
- Angluin, D. 1987. Queries and concept learning. Mach. Learn. 2, 4, 319--342. Google ScholarDigital Library
- Angluin, D. 1990. Negative results for equivalence queries. Mach. Learn. 5, 121--150. Google ScholarDigital Library
- Angluin, D., Frazier, M., and Pitt, L. 1992. Learning conjunctions of horn clauses. Mach. Learn. 9, 147--164. Google ScholarDigital Library
- Angluin, D. and Kharitonov, M. 1995. When won't membership queries help? J. Comput. Syst. Sci. 50, 2, 336--355. Google ScholarDigital Library
- Anthony, M. and Biggs, N. 1992. An Introduction to Computational Learning Theory. Cambridge University Press. Google ScholarDigital Library
- Barcelo, P. 2009. Logical foundations of relational data exchange. SIGMOD Rec. 38, 1, 49--58. Google ScholarDigital Library
- Bernstein, P. A., Green, T. J., Melnik, S., and Nash, A. 2008. Implementing mapping composition. VLDB J. 17, 2, 333--353. Google ScholarDigital Library
- Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. 1987. Occam's razor. Inf. Process. Lett. 24, 6, 377--380. Google ScholarDigital Library
- Bonifati, A., Chang, E. Q., Ho, T., Lakshmanan, V. S., and Pottinger, R. 2005. HePToX: Marrying XML and heterogeneity in your P2P databases. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB'05). 1267--1270. Google ScholarDigital Library
- ten Cate, B., Dalmau, V., and Kolaitis, P. 2012. Learning schema mappings. In Proceedings of the International Conference on Database Theory (ICDT'12). ACM Press, New York, 22--33. Google ScholarDigital Library
- ten Cate, B., Kolaitis, P. G., and Tan, W. C. 2010. Database constraints and homomorphism dualities. In Proceedings of the 16th International Conference on Principles and Practice of Constraint Programming (CP'10). 475-490. Google ScholarDigital Library
- Chandra, A. K. and Merlin, P. M. 1977. Optimal implementation of conjunctive queries in relational data bases. In Proceedings of the 9th Annual ACM Symposium on Theory of Computing (STOC'77). 77--90. Google ScholarDigital Library
- Chekuri, C. and Rajaraman, A. 2000. Conjunctive query containment revisited. Theor. Comput. Sci. 239, 211--229. Google ScholarDigital Library
- Ciucanu, R. and Staworko, S. 2013. Learning schemas for unordered xml. http://dbpl2013.inria.fr/slides/dbpl13-ciucanu-slides.pdf.Google Scholar
- Cohen, W. W. 1995a. PAC-learning non-recursive prolog clauses. Artif. Intell. 79, 1, 1--38. Google ScholarDigital Library
- Cohen, W. W. 1995b. PAC-learning recursive logic programs: Efficient algorithms. J. Artif. Intell. Res. 2, 501--539. Google ScholarDigital Library
- Cohen, W. W. 1995c. PAC-learning recursive logic programs: Negative results. J. Artif. Intell. Res. 2, 541--573. Google ScholarDigital Library
- Das Sarma, A., Parameswaran, A. G., Garcia-Molina, H., and Widom, J. 2010. Synthesizing view definitions from data. In Proceedings of the 13th International Conference on Database Theory (ICDT'10). 89--103. Google ScholarDigital Library
- Doan, A. 2002. Learning to map between structured representations of data. Ph.D. thesis, Department of Computer Science and Engineering, University of Washington. Google ScholarDigital Library
- Doan, A., Madhavan, J., Domingos, P., and Halevy, A. Y. 2002. Learning to map between ontologies on the Semantic web. In Proceedings of the 11th International Conference on World Wide Web (WWW'02). 662--673. Google ScholarDigital Library
- Fagin, R., Kolaitis, P. G., Miller, R. J., and Popa, L. 2005. Data exchange: Semantics and query answering. Theor. Comput. Sci. 336, 1, 89--124. Google ScholarDigital Library
- Fan, W., Geerts, F., Lakshmanan, L. V. S., and Xiong, M. 2009. Discovering conditional functional dependencies. In Proceedings of the IEEE International Conference on Data Engineering (ICDE'09). 1231--1234. Google ScholarDigital Library
- Feldman, V. 2009. Hardness of approximate two-level logic minimization and pac learning with membership queries. J. Comput. Syst. Sci. 75, 1, 13--26. Google ScholarDigital Library
- Flach, P. A. and Savnik, I. 1999. Database dependency discovery: A machine learning approach. AI Comm. 12, 3, 139--160. Google ScholarDigital Library
- Fletcher, G. H. L., Gyssens, M., Paredaens, J., and Gucht, D. V. 2009. On the expressive power of the relational algebra on finite sets of relation pairs. IEEE Trans. Knowl. Data Engin. 21, 6, 939--942. Google ScholarDigital Library
- Fletcher, G. H. L. and Wyss, C. M. 2009. Towards a general framework for effective solutions to the data mapping problem. J. Data Semantics 14, 37--73. Google ScholarDigital Library
- Gillis, J. J. M. and Van Den Bussche, J. 2009. Induction of relational algebra expressions. In Proceedings of the 19th International Conference on Inductive Logic Programming (ILP'09). 25--33. Google ScholarDigital Library
- Gold, M. E. 1967. Language identification in the limit. Inf. Control 10, 5, 447--474.Google ScholarCross Ref
- Gottlob, G. and Senellart, P. 2010. Schema mapping discovery from data instances. J. ACM 57, 2, 6:1--6:37. Google ScholarDigital Library
- Gutjahr, W., Welzl, E., and Woeginger, G. 1992. Polynomial graph-colorings. Discr. Appl. Math. 35, 29--46. Google ScholarDigital Library
- Haas, L. M., Hernandez, M. A., Ho, H., Popa, L., and Roth, M. 2005. Clio grows up: From research prototype to industrial tool. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'05). 805--810. Google ScholarDigital Library
- Haussler, D. 1989. Learning conjunctive concepts in structural domains. Mach. Learn. 4, 7--40. Google ScholarDigital Library
- Hellerstein, L., Pillaipakkamnatt, K., Raghavan, V. V., and Wilkins, D. 1996. How many queries are needed to learn? J. ACM 43, 5, 840--862. Google ScholarDigital Library
- Hirata, K. 2000. On the hardness of learning acyclic conjunctive queries. In Proceedings of the 11th International Conference on Algorithmic Learning Theory (ALT'00). 238--251. Google ScholarDigital Library
- Kearns, M. J. and Valiant, L. G. 1994. Cryptographic limitations on learning boolean formulae and finite automata. J. ACM 41, 1, 67--95. Google ScholarDigital Library
- Kearns, M. J. and Vazirani, U. V. 1997. An Introduction to Computational Learning Theory. The MIT Press. Google ScholarDigital Library
- Kolaitis, P. G. 2005. Schema mappings, data exchange, and metadata management. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS'05). 61--75. Google ScholarDigital Library
- Kolaitis, P. G. and Vardi, M. Y. 2000. Conjunctive-query containment and constraint satisfaction. J. Comput. Syst. Sci. 61, 2, 302--332. Google ScholarDigital Library
- Lenzerini, M. 2002. Data integration: A theoretical perspective. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS'02). 233--246. Google ScholarDigital Library
- Rahm, E. and Bernstein, P. A. 2001. A survey of approaches to automatic schema matching. Very Large Database J. 10, 4, 334--350. Google ScholarDigital Library
- Schapire, R. E. 1990. The strength of weak learnability. Mach. Learn. 5, 197--227. Google ScholarDigital Library
- Staworko, S. and Wieczorek, P. 2012. Learning twig and path queries. In Proceedings of the International Conference on Database Theory (ICDT'12). ACM Press, New York, 140--154. Google ScholarDigital Library
- Tran, Q. T., Chan, C.-Y., and Parthasarathy, S. 2009. Query by output. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'09). 535--548. Google ScholarDigital Library
- Valiant, L. G. 1984. A theory of the learnable. Comm. ACM 27, 11, 1134--1142. Google ScholarDigital Library
- Valiant, L. G. 1985. Learning disjunctions of conjunctions. In Proceedings of the 9th International Joint Conference on Artificial Intelligence (IJCAI'85). 560--566. Google ScholarDigital Library
- Yan, L., Miller, R. J., Haas, L. M., and Fagin, R. 2001. Data-driven understanding and refinement of schema mappings. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'01). 485--496. Google ScholarDigital Library
Index Terms
- Learning schema mappings
Recommendations
Designing and refining schema mappings via data examples
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataA schema mapping is a specification of the relationship between a source schema and a target schema. Schema mappings are fundamental building blocks in data integration and data exchange and, as such, obtaining the right schema mapping constitutes a ...
Characterizing schema mappings via data examples
Schema mappings are high-level specifications that describe the relationship between two database schemas; they are considered to be the essential building blocks in data exchange and data integration, and have been the object of extensive research ...
Learning schema mappings
ICDT '12: Proceedings of the 15th International Conference on Database TheoryA schema mapping is a high-level specification of the relationship between a source schema and a target schema. Recently, a line of research has emerged that aims at deriving schema mappings automatically or semi-automatically with the help of data ...
Comments