机器学习概论机器学习概论 (2).pdf-得力文库

资源描述

《机器学习概论机器学习概论 (2).pdf》由会员分享，可在线阅读，更多相关《机器学习概论机器学习概论 (2).pdf（64页珍藏版）》请在得力文库 - 分享文档赚钱的网站上搜索。

1、Welcome to Introduction to Machine Learning!2010.3.51*Images come from InternetCoffee TimeIBM Watson DeepQACoffee Time3Jeopardy:An American TV showRequires the players to suss out the subtleties of language from jokes and puns to irony and anagramsIBM Watson Jeopardy4February 14,15,and 16,2011 Jeopa

2、rdys two biggest championsBrad Rutter(right):Won a whopping$3.25 million playing Jeopardy,the most cash ever awarded on the show.He is a Johns Hopkins University dropoutKen Jennings(left):Holds the title for longest Jeopardy winning streak,with 74 consecutive wins in 2004.He holds degrees in compute

3、r science and English,from Brigham Young University,and an international BA diploma from Seoul Foreign School.Coffee TimeIBM Watson won the Jeopardy5Towards the Open Advancement of Question Answering Systems Final:$77,147(5,000+35,734+41,413)vs.$21,600&$24,000.Coffee TimeIBM Watson In development fo

4、r 4 yearsRuns on 90 Power 7 serversEach:4*8 power 7Does not connect to the InternetSearch on a large scale knowledge base,not InternetSearch on billion pages within 3sTrained with previous questions and gamesWith Jeopardy players:77(2009)+55(2010,winners)Lack of real-time learning abilityE.g.Categor

5、y:US CitiesQ:“Its largest airport was named for a World War II hero;its second largest,for a World War II battle.”A:“What is Chicago/Toronto?”6Coffee TimeTechnical requirementsAnswers to questions on any topicScience,geography,popular culture Accuracy:not only an answer,but a confident right answerS

6、peed:within 3 second or less7nAdvanced linguistic understandingqParser complex sentences,recognize and understand jokes,metaphors,puns and riddlesnReal time analysis of questionsnLearn from mistakesnBe prepared to handle the unexpected Coffee TimeTechniques involved-DeepQAA massively parallel probab

7、ilistic evidence-based architecture for answer questionsNon-database approachDeep text analyticsNLP and statistical NLPFormulating parallel hypotheses with confidence scoreVoting,Question interpretationSearchRisk assessmentHadoop and UIMADifficulties/Problems in real application scenarios8Coffee Tim

8、eTopic 2:Decision TreesPreliminary:Basic Concepts inMachine Learning LiteratureIntroduction to Machine Learning:Decision Tree Learning10Introduction to Machine Learning:Decision Tree Learning11Given:Instance Space X e.g.possible days,each described by the attributes Sky,AirTemp,Humidity,Wind,Water,F

9、orecastHypothesis Class H e.g.if(Temp=cold AND humidity=high)then play tennis=no.Training Examples D Positive and negative examples of the Target Function C,Determine:A hypothesis hH such that()()h xc x for all xX=Basic Concepts(generally used in machine learning literature)Introduction to Machine L

10、earning:Decision Tree Learning12Typically X is exponentially or infinitely larger,so in general we can never be sure that h(x)=c(x)for all xXInstead,settle for a good approximation,e.g.h(x)=c(x)for all xDSuppose:n binary attributes/features（e.g.true/false,warm/cold）lInstance Space X：2nelementslConce

11、pt(Hypothesis)Space H：at most elements(why?)22nBasic Concepts(generally used in machine learning literature)Part I.Basic Decision tree learningIntroduction to Machine Learning:Decision Tree Learning13Introduction to Machine Learning:Decision Tree Learning14An example:Enjoy SportKnown:In a coming new

12、 day,will the one enjoy the sport?SkyTempHumidWindWaterForecstEnjoySunnyWarm NormalStrongWarmSameYesSunnyWarmHighStrongWarmSameYesRainyColdHighStrongWarmChangeNoSunnyWarmHighStrongCoolChangeYesClassical Targeting Problems15Classification problem involving nominal data.DiscreteNo natural notion of si

13、milarityNo order,in generalAnother Example:FruitColor:red,green,yellow,Size:small,medium,bigShape:round,thinTaste:sweet,sourIntroduction to Machine Learning:Decision Tree LearningRepresentation16Lists of attributes instead of vectors of real numbers.e.g.EnjoySport:6 tuples on Sky,AirTemp,Humidity,Wi

14、nd,Water,ForecastSunny,Warm,Normal,Strong,Warm,SameFruit:4-tuple on color,size,shape,taste red,round,sweet,smallIntroduction to Machine Learning:Decision Tree LearningTraining examplesIntroduction to Machine Learning:Decision Tree Learning17:yellow,thin,medium,sweet:green,round,big,sweet:yellow,thin

15、,small,sweet:green,round,small,sweet:red,round,small,sourSkyTempHumidWindWaterForecstEnjoySunnyWarm NormalStrongWarmSameYesSunnyWarmHighStrongWarmSameYesRainyColdHighStrongWarmChangeNoSunnyWarmHighStrongCoolChangeYesIntroduction to Machine Learning:Decision Tree Learning18Decision tree Concepts Non-

16、leaf Node:feature/attr.Leaf node:Decision/Label/Class/Conceptconjunctiondisjunctionbranch：Value of Feature/Attr.Example 2:Fruits19Introduction to Machine Learning:Decision Tree LearningIntroduction to Machine Learning:Decision Tree Learning20Decision tree Milestones In 1966,first proposed by HuntIn

17、1970s1980sCART by Friedman,BreimanID3 by QuinlanSince 1990sComparative study(Mingers,Dietterich,Quinlan,etc)Most popular DTree algorithm:C4.5 by Quinlan in 1993Classical Decision Tree AlgorithmsCART(classification and regression trees)22A general framework:nCreate or grow a decision tree using train

18、ing datanDecision tree will progressively split the set of training examples into smaller and smaller subsetsnStop splitting if each subset is pure nOr accept an imperfect decisionMany DTree algorithms follow this framework,including ID3,C4.5,etc.Introduction to Machine Learning:Decision Tree Learni

19、ngIntroduction to Machine Learning:Decision Tree Learning23Top-down,greedy searchRecursive algorithmMain Cycle:A:the best decision attribute for the next stepAssign A as decision attribute for nodeFor each value of A(vi),create new descendant of nodeSort training examples to leaf nodesIf training ex

20、amples perfectly classified,Then RETURN,Else drill down to new leaf nodesClassical DTree Alg.ID3？Introduction to Machine Learning:Decision Tree Learning24ID3 Q1:Which attribute is the best one?Outlook,Humidity,Wind,.?Query selection and node impurity25Fundamental principle:simplicityWe prefer decisi

21、ons that lead to a simple,compact tree with few nodesIntroduction to Machine Learning:Decision Tree LearningQuery selection and node impurity26Fundamental principle:simplicityWe prefer decisions that lead to a simple,compact tree with few nodesWe seek a property query T at each node N that makes the

22、 data reaching the immediate descendent nodes as“pure”as possiblePurity ImpurityHow to measure impurity?Introduction to Machine Learning:Decision Tree LearningIntroduction to Machine Learning:Decision Tree Learning27Entropy impurity(is frequently used)Define:0log0=0In information theory,entropy meas

23、ures the purity/impurityof information,or the uncertainty of informationUniform distribution Maximum value of entropy-=jjjwPwPNEntropy)(log)()(2Introduction to Machine Learning:Decision Tree Learning28Entropy2229293535()loglog0.99364646464Entropy S=-=Besides Entropy ImpurityIntroduction to Machine L

24、earning:Decision Tree Learning29Gini impurity(Duda prefers Gini impurity)Misclassification impurity)(max1)(jjwPNi-=-=jjjijiwPwPwPNi)(1)()()(2Impurity(Entropy)Introduction to Machine Learning:Decision Tree Learning302()()()1()ijjijji NP w P wPw=-=jjjwPwPNEntropy)(log)()(2Impurity(Gini)Introduction to

25、 Machine Learning:Decision Tree Learning312()()()1()ijjijji NP w P wPw=-Maximum Gini impurity happens at 1-n*(1/n)2=1-1/nMaximumMaximum GiniGini ImpurityImpurity-=jjjijiwPwPwPNi)(1)()()(2Impurity(Misclassification)Introduction to Machine Learning:Decision Tree Learning322()()()1()ijjijji NP w P wPw=

26、-Maximum Gini impurity happens at 1-n*(1/n)2=1-1/nFor n classes,Maximum Misclassification impurity=Maximum Gini impurity1-max1/n=1-1/nMaximumMaximum misclassificationmisclassification ImpurityImpuritymmaxax mismis.)(max1)(jjwPNi-=Introduction to Machine Learning:Decision Tree Learning33Measuring the

27、 change of impurity I(N)Information Gain(IG),for exampleExpected reduction in entropy due to sorting on A()|(,)()()|vvv Values ASGain S AEntropy SEntropy SS-Expected entropy after sorting on AEntropy of Original SIntroduction to Machine Learning:Decision Tree Learning34Expected reduction in entropy

28、due to sorting on A()|(,)()()|vvv Values ASGain S AEntropy SEntropy SS-Entropy(s)=0.993Entropy(s)=0.993Entropy(sa1=t)=0.7060.7420.936Entropy(sa2=f)=0.6192229293535()loglog0.99364646464Entropy S=-=Information Gain,IGIntroduction to Machine Learning:Decision Tree Learning35Expected reduction in entrop

29、y due to sorting on A()|(,)()()|vvv Values ASGain S AEntropy SEntropy SS-Entropy(s)=0.993Entropy(s)=0.993Entropy(sa1=t)=0.7060.7420.936Entropy(sa2=f)=0.619Information Gain,IG121.0),(,266.0)742.06438706.06426(993.0),(21=+-=ASGainASGainIntroduction to Machine Learning:Decision Tree Learning36ID3 Q2:wh

30、en to RETURN(stop splitting)？“If training examples perfectly classified”Condition 1:if all the data in the current subset has the same output class,then stopCondition 2:if all the data in the current subset has the same input value,then stopPossible condition 3:if all the attributes IG scores are 0,

31、then stopA good idea？Introduction to Machine Learning:Decision Tree Learning37ID3 Q2:when to RETURN(stop splitting)？y=a XOR baby000011101110Information Gain:AttrvalueprobabilityIGa050%0150%b050%0150%lAccording to the proposed condition 3,No attribute could be chosen even at the first step.Introducti

32、on to Machine Learning:Decision Tree Learning38ID3 Q2:when to RETURN？If we ignore the proposed condition 3bb2+,2-1+,1-1+,1-100011TTFFTherere ONLY 2 conditions for stopping splitting in ID3：lThe same output class or The same input valueDiscussion:If they have same input but diff.output,what does it m

33、ean?aIntroduction to Machine Learning:Decision Tree Learning39ID3 example:training samplesTotal:9+,5-;High:3+,4-;Normal:6+,1-Introduction to Machine Learning:Decision Tree Learning40ID3 example:feature selectionGain(S,Humidity)=0.940(7/14)*0.985-(7/14)*0.592=0.151Gain(S,Outlook)=0.246Gain(S,Wind)=0.

34、048Gain(S,Temperature)=0.029Introduction to Machine Learning:Decision Tree Learning41Hypothesis space is completeTarget function surely in thereOutput a single hypothesisCant play over 20 questions(by experience)No back trackingLocal minimaUse all the data in the subset for each stepStatistically-ba

35、sed search choicesRobust to noisy dataHypothesis space search by ID3Introduction to Machine Learning:Decision Tree Learning42Inductive bias in ID3Note H is the power set of instances XNo restriction on the hypothesis spacePreference for trees with high IG attributes near the rootAttempt to find the

36、shortest treeBias is a preference for some hypotheses(search bias),rather than a restriction of hypothesis space H(language bias).Occams razor:prefer the shortest hypothesis that fits the dataIntroduction to Machine Learning:Decision Tree Learning43Occams razorJust gives an idea here,no detail discu

37、ssionFor more information:Domingos,The role of Occams Razor in knowledge discovery.Journal of Data Mining and Knowledge Discovery,3(4),1999.Introduction to Machine Learning:Decision Tree Learning44Decision TreeIntroduction-basic conceptsID3 algorithm as an exampleAlgorithm descriptionFeature selecti

38、onStop conditionsInductive bias for ID3Over-fitting and PruningIntroduction to Machine Learning:Decision Tree Learning45Whats over-fitting？a)b)c)Introduction to Machine Learning:Decision Tree Learning46Whats over-fitting？h H overfits training dataif theres an alternative h H such that:errtrain(h)err

39、test(h)An example of over-fitting in DTreeEach leaf corresponds to a single training point and the full tree is merely a convenient implementation of a lookup tableIntroduction to Machine Learning:Decision Tree Learning47Over-fitting in Decision Tree LearningIntroduction to Machine Learning:Decision

40、 Tree Learning48Two ways of avoid over-fitting for DTreeI.Stop growing when data split not statistically significant(pre-pruning)II.Grow full tree,then post-pruningFor Option I:Avoid over-fittingHard to estimate the size of the treeType I.Pre-Pruning:When to stop splitting(I)Number of instancesIntro

41、duction to Machine Learning:Decision Tree Learning49Frequently,a node is not split further if The number of training instances reaching a node is smaller than a certain percentage of the training set(e.g.5%)Regardless the impurity or error.Any decision based on too few instances causes variance and

42、thus generalization error.Type I.Pre-Pruning:When to stop splitting(2)Threshold of information gain value 50nSet a small threshold value,splitting is stopped ifnBenefits:Use all the training data.Leaf nodes can lie in different levels of the tree.nDrawback:Difficult to set a good thresholdbD)(siIntr

43、oduction to Machine Learning:Decision Tree LearningIntroduction to Machine Learning:Decision Tree Learning51Two ways of avoid over-fitting for D-TreeI.Stop growing when data split not statistically significant(pre-pruning)II.Grow full tree,then post-pruningFor option II:How to select“best”tree？Measu

44、re performance over training data(statistical pruning)Confidence level(will be introduced in mid-term CLTheory I)Measure performance over separate validation data setMDL(Minimize Description Length 最小描述长度):minimize(size(tree)+size(misclassifications(tree)Avoid over-fitting:Type IIIntroduction to Mac

45、hine Learning:Decision Tree Learning52Type II.Post-pruning(1).Reduced-Error pruningSplit data into training set and validation setValidation set:Known labelTest performanceNo model updates during this test!Do until further pruning is harmful:Evaluate impact on validation set of pruningeach possible

46、node(plus the subtree it roots)Greedily remove the one that most improves validation set accuracyHow to assign the label of the new leaf node?Supplement:strategies of the new leaf node label after pruningIntroduction to Machine Learning:Decision Tree Learning53Assign the most common class.Give the n

47、ode multiple-class labelsEach class has a support degree(based on the number of the training data with each label)On test:select one class with probability,or select multiple classesIf it is the regression tree(numeric labels),can be averaged,or weighted average.On validation data(during pruning)Int

48、roduction to Machine Learning:Decision Tree Learning54Effect of Reduced-Error pruning（To be continued）Introduction to Machine Learning:Decision Tree Learning55Type II.Post-pruning(2).Rule Post-pruning1,Convert tree to equivalent set of rulese.g.if(outlook=sunny)(humidity=high)then playTennis=no2,Pru

49、ne each rule by removing any preconditions that result in improving its estimated accuracyi.e.(outlook=sunny),(humidity=high)3,Sort rules into desired sequence(by their estimated accuracy).4,Use the final rules in the same sequence when classifying instances.(after the rules are pruned,it may not be

50、 possible to write them back as a tree anymore.)One of the most frequently used methods,e.g.in C4.5.Why convert the decision tree to rule before pruning?56Independent to contexts.Otherwise,if the tree were pruned,two choices:Remove the node completely,or Retain it there.No difference between root no

展开阅读全文