Data Mining of Very Large Data.ppt

上传人:知****量 文档编号:17596012 上传时间:2022-05-25 格式:PPT 页数:50 大小:113.50KB
返回 下载 相关 举报
Data Mining of Very Large Data.ppt_第1页
第1页 / 共50页
Data Mining of Very Large Data.ppt_第2页
第2页 / 共50页
点击查看更多>>
资源描述

《Data Mining of Very Large Data.ppt》由会员分享,可在线阅读,更多相关《Data Mining of Very Large Data.ppt(50页珍藏版)》请在得力文库 - 分享文档赚钱的网站上搜索。

1、1Data Mining of Very Large Data Frequent itemsets, market baskets A-priori algorithm Hash-based improvements One- or two-pass approximations High-correlation mining2The Market-Basket Model A large set of items , e.g., things sold in a supermarket. A large set of baskets , each of which is a small se

2、t of the items, e.g., the things one customer buys on one day. Problem: find the frequent itemsets : those that appear in at least s (support ) baskets.3Example Items = milk, coke, pepsi, beer, juice. Support = 3 baskets.B1 = m, c, bB2 = m, p, jB3 = m, bB4 = c, jB5 = m, p, bB6 = m, p, b, jB7 = c, b,

3、 jB8 = b, p Frequent itemsets: m, c, b, p, j, m, b, m, p, b, p.4Applications 1 Real market baskets: chain stores keep terabytes of information about what customers buy together. Tells how typical customers navigate stores, lets them position tempting items. Suggests tie-in “tricks,” e.g., run sale o

4、n hamburger and raise the price of ketchup.5Applications 2 “Baskets” = documents; “items” = words in those documents. Lets us find words that appear together unusually frequently, i.e., linked concepts. “Baskets” = sentences, “items” = documents containing those sentences. Items that appear together

5、 too often could represent plagiarism.6Applications 3 “Baskets” = Web pages; “items” = linked pages. Pairs of pages with many common references may be about the same topic. “Baskets” = Web pages p ; “items” = pages that link to p . Pages with many of the same links may be mirrors or about the same t

6、opic.7Scale of Problem WalMart sells 100,000 items and can store hundreds of millions of baskets. The Web has 100,000,000 words and several billion pages.8Computation Model Data is stored in a file, basket-by-basket. As we read the file one basket at a time, we can generate all the sets of items in

7、that basket. The principal cost of an algorithm is the number of times we must read the file. Measured in disk I/Os. Bottleneck is often the amount of main memory available on a pass.9A-Priori Algorithm 1 Goal: find the pairs of items that appear at least s times together. Data is stored in a file,

8、one basket at a time. Nave algorithm reads file once, counting in main memory the occurrences of each pair. Fails if #items-squared exceeds main memory.10A-Priori Algorithm 2 A two-pass approach called a-priori limits the need for main memory. Key idea: monotonicity : if a set of items appears at le

9、ast s times, so does every subset. Converse for pairs: if item i does not appear in s baskets, then no pair including i can appear in s baskets.11A-Priori Algorithm 3 Pass 1: Read baskets and count in main memory the occurrences of each item. Requires only memory proportional to #items. Pass 2: Read

10、 baskets again and count in main memory only those pairs both of which were found in Pass 1 to have occurred at least s times. Requires memory proportional to square of frequent items only.12PCY Algorithm 1 Hash-based improvement to A-Priori. During Pass 1 of A-priori, most memory is idle. Use that

11、memory to keep counts of buckets into which pairs of items are hashed. Just the count, not the pairs themselves. Gives extra condition that candidate pairs must satisfy on Pass 2.13PCY Algorithm 2HashtableItem countsBitmapPass 1Pass 2Frequent itemsCounts ofcandidate pairs14PCY Algorithm 3 PCY Pass 1

12、: Count items. Hash each pair to a bucket and increment its count by 1. PCY Pass 2: Summarize buckets by a bitmap : 1 = frequent (count = s ); 0 = not. Count only those pairs that (a) are both frequent and (b) hash to a frequent bucket.15Multistage Algorithm Key idea: After Pass 1 of PCY, rehash onl

13、y those pairs that qualify for Pass 2 of PCY. On middle pass, fewer pairs contribute to buckets, so fewer false drops - buckets that have count s , yet no pair that hashes to that bucket has count s .16Multistage PictureFirsthash tableSecondhash tableItem countsBitmap 1Bitmap 1Bitmap 2Freq. itemsFre

14、q. itemsCounts ofCandidate pairs17Finding Larger Itemsets We may proceed beyond frequent pairs to find frequent triples, quadruples, . . . Key a-priori idea: a set of items S can only be frequent if S - a is frequent for all a in S . The k th pass through the file is counts the candidate sets of siz

15、e k : those whose every immediate subset (subset of size k - 1) is frequent. Cost is proportional to the maximum size of a frequent itemset.18All Frequent Itemsets in = 1 band. Tune b , r , k to catch most similar pairs, few nonsimilar pairs.43Example Suppose 100,000 columns. Signatures of 100 integ

16、ers. Therefore, signatures take 40Mb. But 5,000,000,000 pairs of signatures can take a while to compare. Choose 20 bands of 5 integers/band.44Suppose C1, C2 are 80% Similar Probability C1, C2 identical in one particular band: (0.8)5 = 0.328. Probability C1, C2 are not similar in any of the 20 bands:

17、 (1-0.328)20 = .00035 . I.e., we miss about 1/3000 of the 80% similar column pairs.45Suppose C1, C2 Only 40% Similar Probability C1, C2 identical in any one particular band: (0.4)5 = 0.01 . Probability C1, C2 identical in = 1 of 20 bands: = 20 * 0.01 = 0.2 . Small probability C1, C2 not identical in a band, but hash to the same bucket. But false positives much lower for similarities PCY (hashing) - multistage. Finding all frequent itemsets: Simple - SON - Toivonen. Finding similar pairs: Minhash + LSH, Hamming LSH.

展开阅读全文
相关资源
相关搜索

当前位置:首页 > 应用文书 > 工作计划

本站为文档C TO C交易模式,本站只提供存储空间、用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。本站仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知得利文库网,我们立即给予删除!客服QQ:136780468 微信:18945177775 电话:18904686070

工信部备案号:黑ICP备15003705号-8 |  经营许可证:黑B2-20190332号 |   黑公网安备:91230400333293403D

© 2020-2023 www.deliwenku.com 得利文库. All Rights Reserved 黑龙江转换宝科技有限公司 

黑龙江省互联网违法和不良信息举报
举报电话:0468-3380021 邮箱:hgswwxb@163.com