《Data exchange.pdf》由会员分享,可在线阅读,更多相关《Data exchange.pdf(36页珍藏版)》请在得力文库 - 分享文档赚钱的网站上搜索。
1、Theoretical Computer Science 336(2005) exchange:semantics and query answeringRonald Fagina,Phokion G.Kolaitisb,1,Rene J.Millerc,1,Lucian Popaa,bUniversity of California at Santa CruzcUniversity of TorontoaIBM Almaden ResearchCenterAbstractData exchange is the problem of taking data structured under
2、a source schema and creating aninstance of a target schema that reflects the source data as accurately as possible.In this paper,weaddress foundational and algorithmic issues related to the semantics of data exchange and to thequery answering problem in the context of data exchange.These issues aris
3、e because,givena sourceinstance,theremaybemanytargetinstancesthatsatisfytheconstraintsofthedataexchangeproblem.Wegive an algebraic specification that selects,among all solutions to the data exchange problem,a special class of solutions that we call universal.We show that a universal solution has no
4、moreand no less data than required for data exchange and that it represents the entire space of possiblesolutions.We then identify fairly general,yet practical,conditions that guarantee the existence ofa universal solution and yield algorithms to compute a canonical universal solution efficiently.We
5、adoptthenotionofthe“certainanswers”inindefinitedatabasesforthesemanticsforqueryansweringin data exchange.Weinvestigate the computational complexity of computing the certain answers inthis context and also address other algorithmic issues that arise in data exchange.In particular,westudy the problem
6、of computing the certain answers of target queries by simply evaluatingthem on aACorresponding author.preliminary version of this paper appeared in the 2003 International Conference on Database Theory 15.E-mail addresses:(R.Fagin),kolaitiscs.ucsc.edu(P.G.Kolaitis),millercs.toronto.edu(R.J.Miller),(L
7、.Popa).1ResearchcarriedoutwhiletheseauthorswerevisitingscientistsattheIBMAlmadenResearchCenter.Kolaitiswas also partially supported by NSF Grant IIS-9907419.Miller was also partially supported by a research grantfrom NSERC.0304-3975/$-see front matter 2004 Elsevier B.V.All rights reserved.doi:10.101
8、6/j.tcs.2004.10.03390R.Faginet al./Theoretical Computer Science 336(2005)89124canonicaluniversalsolution,andweexploretheboundaryofwhatqueriescanandcannotbeansweredthis way,in a data exchange setting.2004 Elsevier B.V.All rights reserved.Keywords:Data exchange;Data integration;Dependencies;Universals
9、olution;Chase;Query answering;Certainanswers;Computational complexity;First-order inexpressibility1.1.IntroductionIntroductionIn data exchange,data structured under one schema(which we call a source schema)must be restructured and translated into an instance of a differentschema(a targetschema).Data
10、 exchange is used in many tasks that require data to be transferred between exist-ing,independently created applications.The first systems supporting the restructuring andtranslation of data were built several decades ago.An early such system was EXPRESS30,which performed data exchange between hiera
11、rchical schemas.The need for systemssupporting data exchangehas persisted overthe years.Recently this need has become morepronounced,as the terrain for data exchange has expanded with the proliferation of webdata that are stored in different formats,such as traditional relational database schemas,se
12、mi-structured schemas(for example,DTDs or XML schemas),and various scientificformats.In this paper,we address several foundational and algorithmic issues related tothe semantics of data exchange and to the query answering problem in the context of dataexchange.1.1.The data exchange problemIn a data
13、exchange setting,we have a source schema S S and a target schema T T,where weassume that S S and T T are disjoint.Since T T can be an independently created schema,it mayhave its own constraints that are given as a settof sentences in some logical formalismover T T.In addition,we must have a way of m
14、odeling the relationship between the sourceand targetschemas.This essential element of data exchangeis captured by source-to-targetdependencies that specify how and what source data should appear in the target.Thesedependencies are assertions between a source query and a target query.Formally,we hav
15、ea setstof source-to-targetdependencies of the form x x(S S(x x)T T(x x),whereS S(x x)is a formula in some logical formalism over S S andT T(x x)is a formula in some(perhapsdifferent)logical formalism over T T.Weassume that all of the variables in x x appear free inS S(x x).We point out that schema
16、mapping tools,such as Clio 26,27,permit the(semi-)automatic discovery of such source-to-target dependencies.Other data translation toolspermit restricted forms of such dependencies to be specified in a rule language and,incertain cases,to be automatically derived from“correspondence”rules between ob
17、jects3.Consider a fixed data exchange setting determined by S S,T T,st,andtas above.Thissetting gives rise to the following data exchange problem:given an instance I over thesource schema S S,materialize an instance J over the target schema T T such that the targetdependenciestare satisfied by J,and
18、 the source-to-target dependenciesstare satisfiedR.Faginet al./Theoretical Computer Science 336(2005)8912491by I and J together.The source schema may also have dependencies that we assume aresatisfiedbythegivensourceinstance.Hence,thesourcedependenciesdonotplayanydirectrole in defining the semantics
19、 of data exchange.The first crucial observation is that there may be many solutions(or none)for a giveninstance of the data exchange problem.Hence,several conceptual and technical questionsarise concerning the semantics of data exchange.First,when does a solution exist?If manysolutions exist,which s
20、olution should we materialize and what properties should it have,sothat it reflects the source data as accurately as possible?Finally,can such a“good”solutionbe efficiently computed?Weconsider the semantics of the data exchangeproblem to be one of the twomain issuesin data exchange.We believe that t
21、he other main issue is query answering.Specifically,suppose that q is a query over the target schema T T,and I is an instance over the sourceschema S S.What does answering q with respect to I mean?Clearly,there is an ambiguityarising from the fact that,as mentioned earlier,there may be many solution
22、s J for I and,as a result,different such solutions J may produce different answers q(J).This conceptualdifficulty was first encountered in the context of incomplete or indefinite databases,whereonehastofindthe“right”answerstoaqueryposedagainstasetof“possible”databases(see,for instance,32).An incompl
23、ete database can be thought of as the set of all databasesthat satisfy a certain specification,that is,all databases that are“possible”for the givenspecification.In this sense,the data exchange problem can be viewed as the problem ofexchanging data between a source database I and an incomplete datab
24、ase representing alltargetinstances Jthat are solutions for I(theysatisfy the specifications of the data exchangeproblem),exceptthatoneisinterestedinactuallymaterializingoneofthesesolutions.Now,suppose that a query is posed against an incomplete database.There is general agreementthatinthiscontext,t
25、he“right”answersarethecertainanswers,thatis,theanswersthatoccurin the intersection of all q(J)s,as J variesoverall“possible”databases.This notion makesgood sense for data exchange as well,where,as discussed above,the“possible”databasesare the solutions J for the instance I.It also has the benefit th
26、at the query semantics isindependent of the specific solution we select for data exchange.Wethus adopt the certainanswers as the semantics of query answering in the data exchange setting and investigatethe complexity of computing the certain answers in the data exchange setting.A relatedimportant qu
27、estion is whether the certain answers of a query can be computed by queryevaluation on the“good”target instance that we chose to materialize.1.2.Data exchange vs.data integrationBefore describing our results on data exchange,we briefly compare and contrast data ex-changewithdataintegration.Following
28、theterminologyandnotationintherecentoverview21,a data integration system is a triple G,S,M,where G is the global schema,S isthe source schema,and M is a set of assertions relating elements of the global schemawith elements of the source schema.Both G and S are specified in suitable languages thatmay
29、 allow for the expression of various constraints.In this generality,a data exchange set-ting(S S,T T,st,t)can be thought of as a data integration system in which S S is the sourceschema,T T andtform the global schema,and the source-to-target dependencies instare the assertions of the data integratio
30、n system.In practice,however,most data integration92R.Faginet al./Theoretical Computer Science 336(2005)89124systems studied to date are either local-as-view(LAV)systems or global-as-view(GAV)systems 18,21,22.In an LAV system,each assertion in M relates one element of thesource schema S to a query(a
31、 view)over the global schema G;moreover,it is typicallyassumed that there are no target constraints(t=).In a GAV system the reverse holds,that is,each assertion in M relates one element of the global schema G to a query(a view)over the source schema S.Since the source-to-target dependenciesstrelate
32、a query overthesourceschemaS StoaqueryoverthetargetschemaT T,adataexchangesettinggeneralizesboth an LAV and a GAV system.In fact,it can be thought of as a global-and-local-as-view(GLAV)system 17,21.The abovesimilarities notwithstanding,there are important differencesbetween data ex-change and data i
33、ntegration.As mentioned earlier,in data exchange scenarios,the targetschema is often independently created and comes with its own constraints.In data inte-gration,however,the global schema G is commonly assumed to be a reconciled,virtualview of a heterogeneous collection of sources and,as such,it is
34、 often assumed to have noconstraints.Therehasbeen,however,somerecentworkthatconsideredtheimpactoftargetconstraints in data integration.This research includes,in particular,the work of Duschkaetal.12,whichshowedhowtocomputemaximallycontainedqueryplansoftargetqueriesin an LAV data integration system w
35、ith target full dependencies,and the work of Cal et al.6,which studied the impact of key and foreign key constraints on query answering in aGAV system.A more significant difference between data exchange and data integration isthatinadataexchangesettingwehavetoactuallymaterializeafinitetargetinstance
36、thatbestreflects the given source instance.In data integration no such exchange of data is required.For query answering,both data exchange and data integration use the certain answers asthe standard semantics of queries over the target(global)schema.In data integration,thesource instances are used t
37、o compute the certain answers of queries overthe global schema.In contrast,in a data exchangesetting,it may not be feasible to couple applications togetherin a manner that data may be retrieved and shared on-demand at query time.This mayoccur,for instance,in peer-to-peer applications that must share
38、 data,yet maintain a highdegree of autonomy.Hence,queries overthe target schema may haveto be answered usingthe materialized target instance alone,without reference to the original source instance.This leads to the followingproblem in data exchange:under what conditions and for whichqueries can the
39、certain answers be computed using just the materialized target instance?1.3.Motivation from ClioThe results presented here were motivated by our experience with Clio,a prototypeschema mapping and data exchange tool to whose development some of us have con-tributed 26,27.In Clio,source-to-target depe
40、ndencies(forming a GLAV system)are(semi)-automatically generated from a set of correspondences between the source schemaand the target schema;these dependencies can then be used in a data integration systemto compute the certain answers to target queries.Most of the applications we considered,howeve
41、r,were decoupled applications that would have had to be rewritten to operate co-operatively,as required in data integration.For this reason,early on in the development ofClio,we recognized the need to go farther and,given a source instance,generate a single“universal”target instance(satisfying the t
42、arget dependencies)that was the result of theR.Faginet al./Theoretical Computer Science 336(2005)8912493schema mapping.In designing the algorithms of Clio for creating the target instance,wewereguidedmainlybyourintuitionratherthanbyformalconsiderations.Itshouldbenotedthat there is a long history of
43、work on data translation that focuses on taking high-level,data-independenttranslationrulesandgeneratingefficient,executabletranslationprograms3,29,30.Yet,we could not find a formal justification for the intuitive choices we madein creating the target instance.In seeking to formalize this intuition
44、and justify the choicesmade in Clio,we were led to explore foundational and algorithmic issues related to the se-manticsofdataexchangeandqueryansweringinthissetting.Cliosupportsschemasthatarerelationalornested(XML).However,challengingissuesalreadyariseintherelationalcase.For this reason,here we focu
45、s exclusively on data exchange between relational schemas;extending this work to other types of schemas is the subject of on-going investigation.1.4.Summary of resultsInSection2,weformallyintroducethedataexchangeproblem.Wethengiveanalgebraicspecification that selects,among all possible solutions for
46、 a givensource instance,a specialclassofsolutionsthatwecalluniversal.Moreprecisely,asolutionforaninstanceofthedataexchangeproblemisuniversalifithashomomorphismstoallsolutionsforthatinstance.Weshowthat a universalsolution has“good”properties that justify its choice for the semanticsofthedataexchangep
47、roblem.WenotethatCaletal.6studiedGAVsystemswithkeyandforeign keyconstraints at the target.By means of a logic program that simulates the foreignkey constraints,they constructed a canonical database,which turns out to be a particularinstance of our notion of universal solution.Giventhedeclarativespec
48、ificationofuniversalsolutions,wegooninSection3toidentifyfairly general,yet practical,sufficientconditions that guarantee the existence of a universalsolution and yield algorithms to compute such a solution efficiently.Towardsthis goal,weuse the concept of a weakly acyclic set of targetdependencies;t
49、his concept is broad enoughto contain as special cases both sets of full tuple-generating dependencies(full tgds)5and acyclicsets of inclusion dependencies 9.In Section 3,we provethat if(S S,T T,st,t)is a data exchange setting such thatstis a set of tgds andtis the union of a weaklyacyclic set of tg
50、ds with a set of equality generating dependencies(egds),then,given asource instance,a universal solution to the data exchange problem exists if and only if asolutionexists.Moreover,foreachdataexchangesetting(S S,T T,st,t)satisfyingtheaboveconditions,there is a polynomial-time algorithm that,given a