可用性:语音识别与键盘
我们看到越来越多的语音识别被实现,并且要求图书馆能够进行良好的语音识别。 与键盘或小键盘相比,它背后的基本原理(可用性方面)是什么? 您有什么理由投资这个开发项目?
以呼叫中心为例。 几年前,几乎每个呼叫中心都使用 IVR,提示输入菜单键。 现在,我们看到越来越多的菜单提示输入语音关键字和/或按下键盘:“请说发票或按 1 查看您的发票”。 或者我们在公司的电话簿中看到同样的内容:“请说出您想要联系的人的名字”...“弗兰克·劳埃德”...“您说的是杰克·弗洛伊德吗?如果您愿意,请说“是”联系此人或拒绝重试”。
我想当你在车里不拿着手机时这是一个优点,但是额外的等待时间值得吗? 所有选择的交互时间更长,尝试分析是否说了某些话时的提示时间更长等等? 此外,可靠性肯定比以前更好,但有时感觉更像是有人决定插入系统的一个玩具,因此给人一种未来感。
有设计使用(或选择不使用)语音识别的 IVR 或软件的经验吗?
谢谢!
We are seeing more and more speech recognition implemented and request for libraries that does good speech recognition. What's the rationale (in term of usability) behind it versus a keyboard or keypad? What reasons would you have to invest in this development?
For example, let's take the call centers. A few years ago, almost every call center used an IVR that prompted for a key for the menus. Now, we're seeing more and more menus with prompt for a spoken keyword and/or a pressed keypad: "please say invoice or press 1 to see your invoice". Or we are seeing the same thing in companies' phone directory: "please say the name of the person you are trying to reach" ... "Franck Loyd" ... "Did you say Jack Freud? Please say yes if you want to reach this person or say no to try again".
I guess it's a plus when you're in your car without holding your phone but is it worth the additional waiting time? Longer interaction for all the choices, longer prompt time while trying to analyze if something was said and so on? Also, reliability is better than it was, definitely, but sometime it feels more like an toy someone decided to plugged into the system so it can feel futuristic.
Any experience designing IVR or software that used (or chose not to) speech recognition?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
可用性是一个非常广泛的术语。 如果我尝试用触摸板输入我的地址,它不会被认为很有用。 一些人认为,使用总体成功率为 70-80% 的语音引擎也不是很有用。 正如其他帖子中所指出的,对于使用手机的人来说,免提输入要容易得多。 然而,如果主题对呼叫者来说有些陌生,那么使用文字输入与数字输入实际上可能不如按键式电话直观。 呼叫者听到不太熟悉的术语和短语在提示的 10-30 秒内无法记住它们,但他们可以用手指将鼠标悬停在听起来最好的选项上或记住选项的顺序。
这是一个奇怪的问题。 通常,在 IVR 环境中使用或不使用语音的决定并不是由世界发展观驱动的。 除非您有确实需要演讲的特定要求,否则您几乎总是会降低总体成功率。 演讲通常是企业形象的一个因素……或者拥有最新的技术玩具。
如今,使用现代 ASR 时,语音识别延迟并不是很高。 在大多数情况下,输入与语音并行处理,语音识别结束之间的时间为 0.5 到 1 秒。 请注意,许多 IVR 在某些输入后需要执行数据查找,这可能会显得系统速度较慢。 正常输入超过 1 秒通常是部署动力不足的标志。
最初实施时它可能并没有动力不足,但通过调整工作,您可以做出很多性能与准确性的决策。 为了实现下一个 0.1%,资源可以超出其应有的峰值。
一般来说,是的。 在可靠性方面,您需要真正查看总体数字才能了解系统。 这是一场统计之战,个人并不是很重要(除非他们拥有 VP 或以上头衔)。 通过优化输入(转换提示)、资源使用和其他语音识别调整参数,您尝试最大限度地提高准确性。 对于基本的自然语言响应,您可以达到 90 多岁。 但是,您的总体成功率要低得多。 想象 5 个提示都为 98%(实际上,您往往有一堆 99,然后是一些 90 年代中期或稍低): .98 * .98 * .98 * .98 * .98 = 90%。 这意味着十分之一的人会失败。 那是在调用者混淆和业务规则之前。 DTMF 输入通常非常接近 100%,即使在多次输入之后也是如此。
优点:
缺点:
Usability is a very broad term. If I were to attempt to enter my address with a touch pad, it wouldn't be considered very usable. Some argue that using a speech engine with an overall success rate of 70-80% isn't very usable either. As indicated in other posts, hands free input can be much easier for those on a mobile phone. However, using words versus numeric input can actually be less intuitive than a touch tone phone if the topic is somewhat foreign to the caller. A caller hearing terms and phrases that aren't very familiar can't remember them in the 10-30 seconds of the prompt but they can hover over the best sounding choice with their finger or remember the order of choices.
This is an odd question. Usually the decision to use speech or not in an IVR environment is not driven from the development view of the world. Unless you have a specific requirement that really requires speech, you are almost always reducing overall success rates. Speech is usually a factor of corporate image ... or having the latest technological toy.
Speech recognition latencies aren't very high these days when using modern ASRs. In most cases, input is handled in parallel with speech and time between end of speech recognition is .5 to 1s. Be aware that many IVRs then need to perform data look-ups after some inputs and this can appear as a slower system. Normal inputs pushing beyond 1s is usually the sign of an under-powered deployment.
It may not have been under-powered when original implemented, but through tuning efforts, you make a lot of performance versus accuracy decisions. To get that next .1%, resources can be pushed beyond what they should be at peak.
In general, yes. On the reliability note, you need to really look at the overall numbers to get a sense of the system. It is a battle of statistics where the individual isn't very important (unless they hold the title of VP or above). Through optimization of the input (shifting prompting), resource usage and other speech reco tuning parameters you attempt to maximize accuracy. For basic natural language responses, you can get in the upper 90s. However, your overall success rate is much lower. Imagine 5 prompts all at 98% (in reality, you tend to have a bunch 99 and then a few mid 90s or slightly below): .98 * .98 * .98 * .98 * .98 = 90%. That means 1 out of 10 failing. That is before caller confusion and business rules. DTMF input is usually very near 100%, even after several inputs.
Pros:
Cons:
我认为语音识别与任何输入方法一样都有优点和缺点。
Pro
缺点
I think that speech-recognition like any method of input has it's pro's and con's.
Pro's
Con's
在某些情况下,公司需要处理旋转电话。 可能会发现仅设置识别系统而不是两者都更具成本效益。
语音识别的开销比按键音大得多。 如果您想要获得最佳结果,您需要不断调整应用程序并针对无法识别的单词发音训练系统。 您还需要非常注意如何通过语音识别提示用户,否则您可能会得到意想不到的响应。
整体按键音要容易得多,因为在任何给定时间只有一组有限的可能选项。
如果你的应用程序足够简单,你的语音记录只会让它变得复杂。 按 2 获取其他语言..
In some cases a company is required to handle rotary phones. It might be found as more cost affective to just setup the recognition system instead of both.
Voice recognition has a lot more overhead than touch tones. If you want the best results you need to constantly tweak the app and train the system on unrecognized word pronunciations. You also need to be very particular on how you prompt the user with voice recognition or you may get unexpected responses.
Overall touch tone is a lot easier as there are only a limited set of possible options at any given time.
If your app is straight forward enough you voice rec many only complicate it. Press 2 for some other language..
语音识别与触摸屏技术相结合无疑是未来的潮流。 作为示例,我使用 tazti 语音识别。 它有 XP 和 Vista 版本。 由于微软的触摸屏“Surface”平台在 Vista 上运行,我确信 tazti 将使用触摸屏技术。 当我尝试 tazti 语音识别时,内置命令效果很好。 它还让我可以创建自己的语音命令,而且效果也很好。 语音搜索 Google 和 Yahoo、Wikipedia Youtube 和许多其他搜索引擎效果很好。 还有许多其他功能。 但它没有听写功能。 我发现我消除了 70% 或更多的互联网产生的点击……也许更多。 注意:Tazti 可以从他们的网站免费下载。
Speech recognition is definetly the wave of the future when combined with touchscreen technology. As example I use tazti speech recognition. It's available in XP and Vista version. Since Microsoft's touchscreen "Surface" platform runs on Vista, I'm sure tazti will work with the touchscreen technology. When I tried tazti speech recognition the built in commands worked great. Also it let's me create my own speech commands and those also work great. Voice searching Google and Yahoo, Wikipedia Youtube and many other search engines works great. Has many other features as well. But it doesn't have dictation. I found that I eliminate 70% or more of my internet generated clicks.... maybe more. NOTE: Tazti is a free download from their website.
长话短说:能够提供语音识别功能的呼叫(尤其是客户服务后电话调查)应首先询问用户是否要在启用语音识别的情况下回答非语言提示。
即使在这篇文章发布多年后,我还是有必要在这里写下,尽管通话自动菜单的语音识别有所改进,但我仍然希望公司首先将其作为一种选择,例如“回复以下基于数字的内容”用声音提问,说是或按 1,说否或按 2。” 也许这不是最好的提问方式,但我没有报酬来创建调查。
这将为想要/需要语音选项的人提供替代方案,但也为不想处理语音或其环境会影响这一点的人提供灵活性(并且可能减少后端)。
很多时候,我身处一个吵闹的空间,或者有一个挑剔的宝宝,或者只是想同时处理一项嘈杂的任务,但我仍然可以用双手输入数字。 环境中的噪音会干扰我听清声音的能力。 其中许多通话所采用的技术还不够先进,不足以消除噪音。
有时,我在输入 1 数字答案之前会想一想,或者听到很长的提示(“在 1 到 10 的范围内,其中 1...”)。 在此过程中,我发出的噪音(沙沙作响的衣袖、打喷嚏、清喉咙)或我周围的噪音都会触发回答“我没听懂”,并重复这个问题。
这不是最学术的回应,但我希望电话“自动呼叫”公司确实找到此评论,并将其用于规划何时使用语音识别以及何时首先向用户提供选项。
Long story short: calls with the ability to provide speech recognition (especially post-customer service phone surveys) should start the questions by asking the user if they want to answer non-verbal prompts with speech recognition enabled or not.
Even so many years after this was posted, I feel compelled to write here that despite improvements to speech recognition for automated menus on calls, I still wish companies would offer it as a choice first, such as "to reply to the following number-based questions with your voice, say yes or press 1, no or press 2." Maybe that's not the best way to ask, but I'm not paid to create surveys.
This would allow for alternatives for people who want/need speech options, but also flexibility (and likely less backend) for people who don't want to deal with speaking, or whose environment would affect this.
There are too many times I've been in a loud space, or have a fussy baby, or just want to multitask with a task that is noisy but I still have my hands available to type numbers. Noise in the environment then interferes with my ability to be heard well. The technology on many of these calls is not advanced enough to be noise-cancelling.
There are also times I think a second before I type a 1-number answer, or am listening to a long prompt ("On a scale of 1 to 10, where 1..."). In the process, a noise I make (a rustling sleeve, a sneeze, throat clearing), or noise around me, triggers the response, "I didn't get that", and repeats the question.
Not the most academic response, but I hope a phone "automated call" company does find this comment and use it in their planning for when to use speech recognition and when to provide a user with options first.