Data Mining.ppt

上传人:hyn****60 文档编号:70746207 上传时间:2023-01-27 格式:PPT 页数:74 大小:660KB
返回 下载 相关 举报
Data Mining.ppt_第1页
第1页 / 共74页
Data Mining.ppt_第2页
第2页 / 共74页
点击查看更多>>
资源描述

《Data Mining.ppt》由会员分享,可在线阅读,更多相关《Data Mining.ppt(74页珍藏版)》请在得力文库 - 分享文档赚钱的网站上搜索。

1、Data MiningConcepts and Tehniquestutorial based on the book:by Jiawei Han and Micheline Kamber1made by Radmilo Pesic&Branko Golubovic1Introduction2made by Radmilo Pesic&Branko GolubovicWhat motivated data mining?Necessity is the mother of invention.Data Collection and Database Creation(1960s and ear

2、lier)Database Management Systems(1970s-early 1980s)Advanced Databases Systems(mid-1980s-present)Web-based Databases Systems(1990s-present)Data Warehousing and Data Mining(mid-1980s-present)New Generation of Integrated Information Systems(2000-)3made by Radmilo Pesic&Branko GolubovicWhat Is Data Mini

3、ng?DatawarehouseDatabasesFlat filesCleaning andIntegrationSelection andTransformationData MiningPatternsKnowledgeEvaluation andPresentationExtracting or“mining”knowledge from large amounts of data.1.Data cleaning2.Data integration3.Data selection4.Data transformation5.Data mining6.Pattern evaluation

4、7.Knowledge presentation4made by Radmilo Pesic&Branko GolubovicComponents of a typical data mining system:Database,data warehouse,or other information repositoryDatabase or data warehouse serverKnowledge baseData mining enginePattern evaluation moduleGraphical user interfaceGraphical user interfaceP

5、attern evaluationData mining engineDatabase orData warehouse serverDatabaseDatawarehouseKnowledgebase5made by Radmilo Pesic&Branko GolubovicData mining On What Kind of Data?Relational DatabasesData WarehousesTransactional DatabasesAdvanced Database Systems and Advanced Database Applications(object-o

6、riented,object-relational,spatial,temporal,time-series,text,multimedia,heterogeneus,legacy databases and the world wide web)6made by Radmilo Pesic&Branko GolubovicRelational Databasescust_IDnameaddressageincomecredit_infoC1Smith,Sandy5463 E Hastings,Burnaby,BC V5A 4S9,Canada21$270001item_IDnamebrand

7、categorytypepriceplace_madesuppliercostI3I8high_res_TVmultidisc-CDplayToshibaSanyohigh resolutionmultidiscTVCD player$988.00$369.00JapanJapanNikoXMusic Front$600.00$120.00empl_IDnamecategorygroupsalarycommissionE55Jones,Janehome entertainmentmanager$18,0002%branch_IDnameaddressB1City Square369 Cambi

8、e St.,Vancouver,BC V5L 3A2,Canadatrans_IDcust_IDempl_IDdatetimemethod_paidamountT100C1E5509/21/9815:45Visa$1357.00trans_IDitem_IDqtyT100T100I3I812empl_IDbranch_IDE55B1customeritememployeebranchpurchasesitem_soldworks_at7made by Radmilo Pesic&Branko GolubovicData WarehousesCleanTransformIntegrateLoad

9、DatawarehouseQuery andanalysis toolsClientClientData source in New YorkData source in ChicagoData source in TorontoData source in VancouverTypical architecture of a data warehouse for AllElectronics8made by Radmilo Pesic&Branko Golubovic156044039540014825605ChicagoTorontoVancouverNew Yorkcomputersec

10、urityphonehomeentertainmentQ1Q2Q3Q4item(types)time(quarters)address(cities)computersecurityphonehomeentertainmentitem(types)computersecurityphonehomeentertainmentitem(types)ChicagoTorontoVancouverNew Yorkaddress(cities)Q1Q2Q3Q4time(quarters)10015015010002000CanadaUSAaddress(countries)time(months)Jan

11、MarchFebRoll-up on addressDrill-down on time data for Q19made by Radmilo Pesic&Branko GolubovicText Databases and Multimedia DatabasesText databases can be:highly unstructured,semistructured or well structuredMultimedia databases store image,audio,and video dataSuch data require a lot of storage spa

12、ce;its continuous-media dataHeterogeneus Databases and Legacy DatabasesThe World Wide Webmining path traversal patterns10made by Radmilo Pesic&Branko GolubovicData Mining Functionalities What Kinds of Patterns Can Be Mined?Concept/Class Description:Characterization and DiscriminationAssociation Anal

13、ysisClassification and PredictionCluster AnalysisOutlier AnalysisEvolution Analysis11made by Radmilo Pesic&Branko GolubovicAre All of the Patterns Interesting?A pattern is interesiting if it is:easily understoodvalid(potentially)usefulnovelor if itconfirms users hypothesisInteresting pattern represe

14、nts knowledge!12made by Radmilo Pesic&Branko GolubovicObjective measures of pattern interestingness:supportconfidenceSubjective measures of pattern interestingness:data is unexpecteddata is actionabledata is expectedCan a data mining system generate all of the interesting patterns?Can a data mining

15、system generate only interesting patterns?13made by Radmilo Pesic&Branko GolubovicClassification of Data Mining Systemsaccording to kinds of databases mined(relational,data warehouse,object-oriented)according to kinds of knowledge mined(association,classification,clustering;generalized,primitive-lev

16、el or knowledge at multiple levels;regularities or irregularities)according to the kinds of techniques utilized(autonomous,interactive exploratory or query-driven systems;data warehouse oriented,statistics)according to the applications adapted(for finance,DNA,etc.)DataMiningDatabasetechnologyInforma

17、tionscienceMachinelearningStatisticsVisualizationOther disciplines14made by Radmilo Pesic&Branko GolubovicMajor Issues in Data MiningMining methodology and user interaction issues:Mining different kinds of knowledge in databasesInteractive mining of knowledge at multiple levels of abstractionIncorpo

18、ration of background knowledgeData mining query languages and ad hoc data miningPresentation and visualization of data mining resultsHandling noisy or incomplete dataPattern evaluation the interestingness problemPerformance issues:Efficiency and scalability of data mining algorithmsParallel,distribu

19、ted,and incremental mining algorithmsIssues relating to the diversity of database types:Handling of relational and complex types of dataMining information from heterogeneous databases and global information systems15made by Radmilo Pesic&Branko Golubovic2Data Warehouse and OLAP Technology for Data M

20、ining16made by Radmilo Pesic&Branko GolubovicWhat Is a Data Warehouse?“A datawarehouse is a subject-oriented,integrated,time-variant,and nonvolatile collection of data in support of managements decision making process.”W.H.InmonSubject-orientedIntegratedTime-variantNonvolatile17made by Radmilo Pesic

21、&Branko GolubovicHow are organizations using the information from data warehouse?Increasing customer focusRepositioning products and managing product portfoliosAnalyzing operations and looking for sources of profitManaging the customer relationships,making environmental corrections,and managing the

22、cost of corporate assetsDifferent approach to heterogeneous database integration:Query-driven approach(wrappers and integrators)Update-driven approach18made by Radmilo Pesic&Branko GolubovicDifferences Between Operational Database Systems and Data WarehouseUsers and system orientationData contentsDa

23、tabase designViewAccess patternsWhy have a separate data warehouse?19made by Radmilo Pesic&Branko GolubovicA Multidimensional Data Model From Tables and Spreadsheets to Data CubesA data cube is defined by dimensions and factsDimension tableFact table20made by Radmilo Pesic&Branko Goluboviclocation=“

24、Chicago”location=“New York”location=“Toronto”location=“Vancouver”p.phonesec.timeQ1854882896231087968388728187464359160582514400Q29438906469811301024419258947695268268095231512Q31032924597891034104845100294079558728812102330501Q4112999263870114210915498497886459784927103838580156044039540014825605Chi

25、cagoTorontoVancouverNew YorkcomputersecurityphonehomeentertainmentQ1Q2Q3Q4item(types)time(quarters)location(cities)680952812102310389275015805123130388943389687466238825918726827287849251002984698789870A 2-D view of sales data for AllElectronics,and its 3-D data cube representation21made by Radmilo

26、Pesic&Branko GolubovicChicagoTorontoVancouverNew YorkcomputersecurityphonehomeentertainmentQ1Q2Q3Q4item(types)time(quarters)location(cities)40014825605computersecurityphonehomeentertainmentitem(types)computersecurityphonehomeentertainmentitem(types)supplier=“SUP1”supplier=“SUP1”supplier=“SUP2”A 4-D

27、data cube representation of sales data for AllElectronics22made by Radmilo Pesic&Branko Golubovictime,location,suppliertime,item,locationtime,item,supplieritem,location,suppliertime,item,location,suppliertime,locationtime,supplierlocation,suppliertime,itemitem,locationitem,suppliertimelocationitemsu

28、pplierall0-D(apex)cuboid1-D cuboid4-D(base)cuboid2-D cuboid3-D cuboidLattice of cuboids,making up a 4-D data cube23made by Radmilo Pesic&Branko GolubovicStars,Snowflakes,and Fact Constellations:Schemas for Multidimensional DatabasesStar schema:a large central table(fact table)a set of smaller attend

29、ant tables(dimension tables),one for each dimensiontime_keyitem_keybranch_keylocation_keydollars_soldunits_soldtime_keydayday_of_weekmonthquarteryearitem_keyitem_namebrandtypesupplier_typelocation_keystreetcityprovince_or_statecountrybranch_keybranch_namebranch_typetimedimension tablelocationdimensi

30、on tableitemdimension tablebranchdimension tablesalesfact table24made by Radmilo Pesic&Branko GolubovicSnowflake schema:a variant of star schema,where some dimension tables are normalizedreduce redundancies,but reduce the effectivness of browsingtime_keyitem_keybranch_keylocation_keydollars_soldunit

31、s_soldtime_keydayday_of_weekmonthquarteryearitem_keyitem_namebrandtypesupplier_keybranch_keybranch_namebranch_typetimedimension tablelocationdimension tableitemdimension tablebranchdimension tablesalesfact tablelocation_keystreetcity_keysupplier_keysupplier_typesupplierdimension tablecity_keycitypro

32、vince_or_statecountrycitydimension table25made by Radmilo Pesic&Branko GolubovicFact constelation:multiple fact tables share dimension tablestime_keyitem_keybranch_keylocation_keydollars_soldunits_soldtime_keydayday_of_weekmonthquarteryearitem_keyitem_namebrandtypesupplier_typelocation_keystreetcity

33、province_or_statecountrybranch_keybranch_namebranch_typetimedimension tablelocationdimension tableitemdimension tablebranchdimension tablesalesfact tableshippingfact tableitem_keytime_keyshipper_keyfrom_locationto_locationdollars_soldunits_shippedshipper_keyshipper_namelocation_keyshipper_typeshippe

34、rdimension table26made by Radmilo Pesic&Branko GolubovicDefining multidimensional schemaDMQL data mining query languageSyntax:cube definition:define cube :dimension definition:define dimension as()27made by Radmilo Pesic&Branko GolubovicExample:Constellation schema defined in DMQL:define cube sales

35、time,item,branch,location:dollars_sold=sum(sales_in_dollars),units_sold=count(*)define dimension time as(time_key,day,day_of_week,month,quarter,year)define dimension item as(item_key,item_name,brand,type,supplier_type)define dimension branch as(branch_key,branch_name,branch_type)define dimension loc

36、ation as(location_key,street,city,province_or_state,country)define cube shipping time,item,shipper,from_location,to_location:dollars_cost=sum(cost_in_dollars),unit_shipped=count(*)define dimension time as time in cube salesdefine dimension item as item in cube salesdefine dimension shipper as(shippe

37、r_key,shipper_name,location as location in cube sales,shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales28made by Radmilo Pesic&Branko GolubovicMeasures:Their Categorization and ComputationMeasures,based on the aggregate functi

38、on:DistributiveAlgebraicHolistic29made by Radmilo Pesic&Branko GolubovicIntroducing Concept HierarchiesA concept hierarchy defines a sequence of mappings from a set of low-level to higher-level concepts.allCanadaUSABritish ColumbiaOntarioNew YorkIllinoisVancouverVictoriaTorontoOttawaBuffaloNew YorkC

39、hicagoallcountryprovince_or_statecitylocation30made by Radmilo Pesic&Branko GolubovicHierarchial and lattice structures of atributes in warehouse dimensions:countryprovince_or_statecitystreetyearweekdaymonthquarterHierarchy for locationLattice for time31made by Radmilo Pesic&Branko GolubovicOLAP Ope

40、rations in the Multidimensional Data ModelRoll-upDrill-downSlice and dicePivot(rotate)Other(drill-across,drill-through)32made by Radmilo Pesic&Branko Golubovic156044039540014825605ChicagoTorontoVancouverNew YorkcomputersecurityphonehomeentertainmentQ1Q2Q3Q4item(types)time(quarters)location(cities)20

41、001000USACanadacomputersecurityphonehomeentertainmentQ1Q2Q3Q4item(types)time(quarters)location(countries)150100150ChicagoTorontoVancouverNew Yorkcomputersecurityphonehomeentertainmentitem(types)time(months)location(cities)JanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberroll

42、-upon location(from cities to countries)drill-downon time(from quarters to months)33made by Radmilo Pesic&Branko Golubovic156044039540014825605ChicagoTorontoVancouverNew YorkcomputersecurityphonehomeentertainmentQ1Q2Q3Q4item(types)time(quarters)location(cities)395605USACanadacomputerhomeentertainmen

43、tQ1Q2item(types)time(quarters)location(cities)dice for(location=“Toronto”or“Vancouver”)and(time=“Q1”or“Q2”)and(item=“home entertainment”or“computer”)slicefor time=“Q1”40014825605VancouverTorontoNew YorkChicagocomputersecurityphonehomeentertainmentitem(types)location(cities)40014825605VancouverToront

44、oNew YorkChicagocomputersecurityphonehomeentertainmentitem(types)location(cities)pivot34made by Radmilo Pesic&Branko GolubovicA Starnet Query Model for Querying Multidimensional Databasescontinentcountryprovince_or_statecitystreetlocationdaymonthquarteryeartimenamebrand category typeitemnamecategory

45、groupcustomer35made by Radmilo Pesic&Branko GolubovicData Warehouse ArchitectureSteps for the Design and Construction of Data WarehouseThe Design of a Data Warehouse:A Business Analysis Frameworktop-down viewdata source viewdata warehouse viewbusiness query view36made by Radmilo Pesic&Branko Golubov

46、icThe Process of Data Warehouse Designtop-down approachbottom-up approachcombined approachwaterfall methodspiral methodSteps of the warehouse design:1)Choosing a business proces to model;2)Choosing the grain of the business proces;3)Choosing the dimensions;4)Choosing the measures.37made by Radmilo P

47、esic&Branko GolubovicA Three-Tier Data Warehouse ArchitectureOutputQuery/reportAnalysisData miningOLAP serverOLAP serverMonitoringAdministrationData warehouseData martsMetadata repositoryExtractCleanTransformLoadRefreshOperational databasesExternal sourcesDataBottom tier:data warehouseserverMiddle t

48、ier:OLAP serverTop tier:front-end tools38made by Radmilo Pesic&Branko GolubovicThere are three data warehouse models:Enterprise warehouseData martVirtual warehouse39made by Radmilo Pesic&Branko GolubovicTypes of OLAP Servers:ROLAP versus MOLAP versus HOLAPRelational OLAP(ROLAP)servers:use of relatio

49、nal or extended-relational DBMSgreater scalabilityMultidimensional OLAP(MOLAP)servers:use of data cube fast indexingpossible low storage utilization use of compressionHybrid OLAP(HOLAP)servers:scalability of ROLAP and faster computation of MOLAPMicrosoft SQL Server 7.0 OLAP Services supports HOLAP s

50、erver40made by Radmilo Pesic&Branko GolubovicData Warehouse ImplementationSQL group byData cube computation extends SQL with compute cubeExample:v“Compute the sum of sales,grouping by item and city.”v“Compute the sum of sales,grouping by item.”v“Compute the sum of sales,grouping by city.”The possibl

展开阅读全文
相关资源
相关搜索

当前位置:首页 > 生活休闲 > 生活常识

本站为文档C TO C交易模式,本站只提供存储空间、用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。本站仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知得利文库网,我们立即给予删除!客服QQ:136780468 微信:18945177775 电话:18904686070

工信部备案号:黑ICP备15003705号-8 |  经营许可证:黑B2-20190332号 |   黑公网安备:91230400333293403D

© 2020-2023 www.deliwenku.com 得利文库. All Rights Reserved 黑龙江转换宝科技有限公司 

黑龙江省互联网违法和不良信息举报
举报电话:0468-3380021 邮箱:hgswwxb@163.com