technology_voice.ppt-得力文库

资源描述

《technology_voice.ppt》由会员分享，可在线阅读，更多相关《technology_voice.ppt（37页珍藏版）》请在得力文库 - 分享文档赚钱的网站上搜索。

1、CS 260:Lecture 10Professor John Canny4/23/20231Speech:the Ultimate Interface?4In the early days of HCI,people assumed that speech/natural language would be the ultimate UI4Use of speech interfaces has grown,but its still rarely used in the office.Why?4/23/20232Speech:the Ultimate Interface?4Why spee

2、ch hasnt succeeded in the office:4Affordances of text:*Visual scanning(for email or docs)*Unambiguity of text*Editing of text4Disadvantages of speech:*Noise call center ambience*Lack of privacy4/23/20233Speech:the Ultimate Interface?4Use of speech interfaces has grown,but its still rarely used in th

3、e office.4/23/20234Computing is MovingWhere are computers these days?Intels breakdown(based on PC sales):4Office4Home4Mobile(laptops)4MedicalAnd as we noted earlier,programmable smartphones will soon outnumber total PCs.Then there are game boxes,cable boxes,Smart TVs etc.4/23/20235What is a good int

4、erface for:4Mobile computing(walking or driving)?4Home computing?4Medical computing?4/23/20236Where is the industry now?:4After a big slump around 2002,the speech technology/voice interface industry seems to be growing briskly,about 30-40%per year.One current estimate put it at about$2.5 Billion.4It

5、 would probably be more visible,except several related industries have overtaken it:outsourced call centers,and VOIP(Voice Over IP).4The biggest growth has been in the new markets:*Cell phones(as a local UI)*Medical(e.g.order entry)*Voice services over the phone4/23/20237Industry movementIn January

6、this year,Yahoo acquired a large team of speech engineers from Nuance,the largest speech company(which owns Dragon NaturallySpeaking).Google already had some leading speech researchers.So there is much interest in speech for the portal market.Aside:there is a division of Nuance devoted to medical sp

7、eech recognition,and one to call centers.4/23/20238Industry movementHeyanita:Voice based email and messagingBevocal:Hosted IVR(Interactive Voice Response)for customers,e.g.MetroPCSTellme:Find a business service(including restaurants)using ASR.4/23/20239Speech:Some background A speech recognizer cons

8、ists of 3 stages:A state-of-the-art recognizer requires 50-100 Mflops for continuous speech(no pauses between words).PC continuous speech recognizers appeared in the 1990s and saved many victims of RSI.AcousticFront EndAcousticModelLanguage/phoneticmodelRawsoundAcousticfeaturesPhoneticfeaturesWords4

9、/23/202310Speech:Some backgroundThe first two stages are standard.The last is not,and has a big impact on performance.The last box encodes knowledge of what users might say,either as a grammar,or as a statistical language model(LM).Grammars are suitable for small recognition tasks with well-known co

10、mmand languages.AcousticFront EndAcousticModelLanguage/phoneticmodelRawsoundAcousticfeaturesPhoneticfeaturesWords4/23/202311Speech UIs4Most implement a finite-state machine.4At each state,the system can recognize various speech segments to take you to the next state(s).4A segment may be a word,throu

11、gh to a complete utterance.4The system can also make utterances of its own at various states.4You can specify them usingregular expressions,or using VoiceML.4/23/202312Speech on phonesSpeech recognition is faster and more accurate if you limit the vocabulary to a few dozen words.Small-vocabulary spe

12、ech recognition has been common on phones for the last few years:4Call a number4Call a name(from your contacts)What about large vocabulary,continuous speech?4/23/202313This years Smartphone(free with service contract)4 150-200 MHz ARM processor 4 32 MB ram4 2 GB flash(not included)Windows-98 PC that

13、 boots quickly!Plus:4 Camera4 AGPS(Qualcomm/Snaptrack)4 DSP cores,OpenGL GPU4 EV-DO(300 kb/s),Bluetooth200 mipsThis years Smart phone4/23/202314Speech on phonesThis is just the right power for high-performance speech recognition.Large-vocabulary speech recognition(not continuous)appeared on phones l

14、ast year:Samsung P207LVCSR(Large-Vocabulary ContinuousSpeech Recognition)should be available this year.4/23/202315Speech in the homeGood speech recognition used to require careful microphone placement and a worn headset.4/23/202316Speech in the homeNew microphones:array mics with builtin DSPs allow

15、recognition at greater range(several feet).Users dont have to wear microphones any more to use speech.4/23/202317Speech in the homeApart from CPU and memory(which are shrinking),speech recognition requires only a microphone and perhaps a speaker.It is power and size efficient.In a few years,it will

16、probably be possible to build speech recognition into bluetooth microphones,or other small devices.Compare with other interfaces4/23/202318Ten Guidelines for Speech Interfaces1.You cant design what you cant define2.Use user-centered design techniques3.Use the right technology,and use technology righ

17、t4.Leverage the language instinct5.Establish success criteria and test against them6.Branding in VUI is more than just a pretty voice 7.How you say it is as important as what you say8.Dont block the exit9.Take care with error handling10.Establish a change process4/23/2023191.You cant design what you

18、 cant define4Consider the task(s)that your users want to do,i.e.start with standard task analysis.4What conceptual model do they have(use contextual inquiry)?4What language do they use to refer to it?4Use recordings during contextual inquiry/task analysis.4/23/2023202.Use user-centered design techni

19、ques4Great to see this advice in a trade publication.You know a lot about this:4Study real use context especially important for mobile devices,medical,home etc.4Performs needs analysis what kinds of service might the system provide and how valuable are they?4Develop personae to guide your design4Onc

20、e again,study users conceptual models4/23/2023213.Use the right technology,and use technology right4In a speech interface,you have a choice between synthesized and recorded speech for output.4In designing the recognizer,language-models will generally give better results for routing a broad range of

21、user questions.4Using technology right:speech recognizers are fussy animals.They use many parameters to trade-off performance and accuracy.You have to experiment with these in order to understand them.4/23/2023224.Leverage the Language InstinctMake a voice UI resemble natural speech:4Use familiar ph

22、rasing4Dont mimic written language4Use conversational style(pronouns,acknowledgements,transition words)4Use realistic prosody(pitch etc.)in TTS4Enable callers to speak over and interrupt the TTS system4/23/2023235.Establish Success Criteria and Test Against them4Standard tests:recognition accuracy,s

23、peed,CPU4Dialog traversal tests:capture many conversations and plot the paths through your dialog hierarchy that users took.4Usability testing4Early rapid prototyping:WOZ testing4Define“call success”in a sensible way,and track it!4/23/2023246.Branding is more than a pretty voice4Users make strong at

24、tributions about a human speaker(personality,education,demographics).They do the same with speech interfaces(whether you intend it or not).4Design of a voice UI is as significant as design of an attractive web site.A“robot”voice UI is like a 12-point text-only web site.4The voice interfaces brand pe

25、rception is a combination of prosody and language,just like a real speakers.Design both explicitly.4/23/2023257.How you say it is as important as what you sayMostly about speech constructed from recorded voice.4For“natural”speech,you need to think about the context of each word in real speech.4Pronu

26、nciation actually changes when words are connected together(this is co-articulation).4Ideally,you would include appropriate context information in each recording(e.g.the number“one”followed by a“t”consonant).4/23/2023268.Dont block the exit4Make sure users can exit the automated system and reach a l

27、ive person.4If you make it hard,they will get there anyway,and be angry when they do.4Providing feedback can help(e.g.the estimated time to reach a representative is,do you wish to return to the automated system?).4Make sure you transfer user data from the automated system to the service persons con

28、sole it looks really bad if you dont.4/23/2023279.Take Care with Error Handling4Most speech dialog systems have internal state(in a state machine)that the user cant see except through what the system says.4You must treat errors(e.g.unrecognized utterances)very carefully.If you leave the current stat

29、e,make sure users can understand the state youve gone into.4Large changes(e.g.backtracking up to the initial state)is extremely frustrating for users.4If you backtrack,take small steps,only as much as needed.4/23/20232810.Establish a Change Process4Speech UIs are very complex,and very sensitive to s

30、ome small changes(esp.in the recognizer).4Make sure you manage changes to the system especially low level changes.They should be discouraged once the system is deployed.4Establish“regression tests”representative speech segments that the system should always process successfully,and check them.4Alway

31、s keep several working generations of the system.4/23/202329The state of the ArtA few services represent the state-of-the-art:4United Airlines flight information at 1-800-864-8331 then 1,4Fedex package rates 1-800-463-33394Tellme 1-800-555-1212,or 411 from Cingular wireless or Verizon landlines.4Wil

32、dfire:Speech phone services,voice dialing,messages,etc.1-800-WILDFIRE4Also:Schwab,Wachovia,E-trade,B-of-A,Fidelity,4/23/202330Conversational Speech4CLERK:Make of car?4DRIVER:Uh Mercedes 4CLERK:Model year?4DRIVER:Its a 1970.4CLERK:Color?Compare with this version:4CLERK:Whats the make of your car?4DRI

33、VER:Uh Mercedes 4CLERK:OK.And the model year?4DRIVER:Its a 1970.4CLERK:Got it.Whats the color?4/23/202331Conversational Speech4The second version is both more polite and more usable.4System status(i.e.that it understood the users responses)is always clear.4Design of speech“character”should include n

34、ormal human styling(politeness)but not excessive anthropomorphism.4In particular,the system should never suggest capabilities it does not have.4/23/202332Conversational Speech ExampleSystem:This is the delivery tracking center.Tell me your four-digit delivery number or enter it on the keypad.Caller:

35、4-8-3-3 System:4-8-3-3 Is that right?Caller:Yes.System:OK,hold on(logs into system)Whats your status?You can say arrived,departed or delayed.Caller:Ill be delayed two days.Theres a big storm.System:Oh,sorry to hear that!Let me confirm.I have delivery number 4-8-3-3 delayed for 48 hours due to weathe

36、r.Is that right?Caller:Yes it is.System:Great.Hold onOK.Its in the system.Hopefully youll be on your way soon.Ill talk to you when you arrive.Drive safely.4/23/202333Conversational Speech4Very good usability is possible through clever design.4It does not all depend on raw recognizer accuracy.4Carefu

37、l design includes appropriate personality,giving enough flexibility to the user,and responding to errors carefully.4/23/202334Whats happening now4Over the last half-dozen years,speech interfaces have gotten a lot better.4Most of the improvement seems to be due to improvements in method,i.e.iterative

38、 design,and heuristic guidelines like the ones just presented.4The field is a lot more interdisciplinary than it used to be,including speech engineers,UI designers and linguists.4/23/202335The Future:Context-Awareness4Speech interfaces are rather limited today because they either rely on tightly con

39、strained utterances,or on coarse language models.4In many cases,especially for mobile phones,there is a lot of constraint on what users might do from the context of use(time,location,meta-data on the phone)4Current research is using context data to improve recognition all the way down.Instead of gen

40、eral language models in the recognizer,you can“push down”context information into it.The recognizer can still recognize anything,but it will do better with more likely utterances.4/23/202336Summary4Speech seems like a very good option for future computing environments.4Small devices can support spee

41、ch interfaces,and microphone technology is getting better.4Speech UI design requires many of the same principles as general UI design,especially:*Visibility of system status*User control and freedom*Helping users recognize and recover from errors4Application of these principles leads to highly usable designs.4/23/202337

展开阅读全文