当前位置：文江博客话题详情

带有通配符命令元素的 NSSpeechRecognizer

发布于 2024-12-27 03:56:24 字数 271 浏览 3 评论 0原文

NSSpeechRecognizer 的文档指出，可以通过单个语音命令执行复杂的多步骤操作，例如：

“安排明天十点与 Adam 和 John 会面。”

我能够执行预先编程的简单命令，但我不知道如何使用该类解释上述内容。看起来好像

“用 * * 安排 *”

应该是一个命令。知道这样的事情是否可能吗？或者我们只是应该将无限数量的可能命令传递给识别器？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

且行且努力 2025-01-03 03:56:24

从 NSSpeechRecognizer 文档中我看来它不支持使用复杂的短语，例如您给出的示例。要从这样的短语中获取语义，您可以使用支持多时隙语法与大多数支持 VoiceXML 标准的 IVR 系统一样。在我看来，这个语音识别 API 仅支持以数组形式传入简单命令，而不支持指定复杂的语法规则。对于这种类型的系统，您必须实现所谓的定向对话，其内容可能如下所示：

C：您想做什么？

U：安排一次会议。

C：告诉我你最想参加的第一个人是谁？

乌：亚当。

C：告诉我下一个将参加的人，或者如果参加者名单完整，请说“完成”。

乌：约翰。

C：告诉我下一个将参加的人，或者如果参加者名单完整，请说“完成”。

乌：完成了。

C：这次会议是哪一天？

于：明天。

C：会议时间是几点？

U：十点钟。

中：谢谢。你们的会议已经安排好了。

使用定向对话框，您可以将预期的命令/话语限制为更明确的列表。尽管您的可能名称列表可能会非常大，除非您从用户联系人列表中挑选它们。

回复收藏 0 原文

晨曦÷微暖 2025-01-03 03:56:24

我对文档的解释是，您需要自己积累和操作任何复合状态。您向 NSSpeechRecognizer 提供一组离散的单词/短语，它应该将其识别为“命令”，并在识别它们时向您报告。

对于你给出的例子，我认为当你到达“亚当和约翰”部分时你会遇到问题——它不是一个任意的听写引擎。但是，为了好玩，让我们尝试想象如何做到这一点：

您可能会告诉它您想要将以下短语识别为“命令”：

“安排一次”
“会议”（也许还有“约会”、“玩耍日期”等））
“与”
“亚当和约翰”
“明天”（可能还有其他相关的事物，如“今天”、“两天后”、一周中的所有日子等）
十点”

“ 单词/短语被识别后，您可以根据之前识别的单词/短语创建一堆语义相关的单词/短语。例如，它识别“schedule a”短语，并且您知道应该有更多信息来填充语义上下文，因此您将该短语推入堆栈。接下来，它识别“会议”。您的应用程序会说“当然，可以安排会议”并将其推送到堆栈上。如果它识别的下一个单词与之前识别的“schedule a”命令没有密切关系，那么它将清除堆栈。如果在任何时候，堆栈上的元素满足完全形成的语义意图表达的一些预定义标准，那么您的应用程序就可以根据该意图采取适当的操作。显然这也有一个时间因素。如果建立语义上下文所需的下一件事没有在合理的时间内到达，则应该清除语义上下文堆栈。

从概念上讲，类似的系统是 iOS/MacOS 触摸/触控板手势识别系统。当发生点击触摸时，操作系统必须识别单击，并承认这可能是整个用户意图，但它还必须管理可能很快收到另一次点击的可能性，将单击变成一次点击双击。它必须随着时间的推移积累这种状态，并通过查看离散事件的组合来推断用户意图。

您不会免费从 NSSpeechRecognizer 获得此类功能，并且由于它不是听写引擎，您也不会从中获得任意“令牌”（例如“亚当和约翰”，假设您没有注册一些巨大的名称列表全部作为潜在的命令。）即便如此，这并不意味着不能使用我所描述的机制来利用它来做一些非常简洁的事情。只是你必须自己写。

祝你好运！

My interpretation of the docs is that you would need to accumulate and operate on any compound state on your own. You provide NSSpeechRecognizer with a set of discrete words/phrases that it should recognize as 'commands', and it reports to you when it has recognized them.

For the example you've given, I think you'll run into problems when you get to the "Adam and John" part -- it's not an arbitrary dictation engine. But, for fun, let's try to imagine how we might do this:

You might tell it you want to recognize the following phrases as 'commands':

"schedule a"
"meeting" (and perhaps "appointment", "playdate", etc)
"with"
"Adam and John"
"tomorrow" (and probably other related things like "today", "two days from now", all the days of the week, etc)
"ten o'clock"

As words/phrases are recognized, you could create a stack of semantically related words/phrases based on previously recognized words/phrases. So, for instance, it recognizes the "schedule a" phrase, and you know that there should be more info coming to fill out the semantic context, so you push that phrase onto the stack. Next, it recognizes "meeting". Your app says 'sure, a meeting is something that can be scheduled' and pushes it onto the stack as well. If the next word it recognized wasn't germane to the previously-recognized "schedule a" command, then it would clear the stack. If, at any point, the elements on the stack satisfy some pre-defined criteria for a fully formed expression of semantic intent, then your app can take the appropriate action based on that intent. There's obviously a temporal element to this as well. If the next thing required to establish semantic context doesn't arrive in a reasonable amount of time, the semantic context stack should get cleared.

A similar system, conceptually, is the iOS/MacOS touch/trackpad gesture recognition system. When a tap touch happens, the OS has to recognize the single tap, and acknowledge the possibility that that is the entire user intent, but it also has to manage the possibility that it might receive another tap very shortly, turning the single tap into a double tap. It will have to accumulate this state over time, and infer the user intent by looking at the combination of discrete events.

You're not going to get such functionality from NSSpeechRecognizer for free, and being that it's not a dictation engine, you also won't get arbitrary 'tokens' from it (like "Adam and John", assuming you're not registering some giant list of names all as potential commands.) Even so, that doesn't mean it couldn't be leveraged to do some pretty neat stuff using a mechanism like I described. It's just that you're gonna have to write it yourself.

Good luck!

回复收藏 0 原文

~没有更多了~