使用 SAPI 发出与正常语音不同的命令

发布于 2024-12-13 07:35:40 字数 2881 浏览 8 评论 0原文

我正在开展一个个人项目，涉及我公寓中的麦克风，我可以向其发出口头命令。为了实现这一目标，我一直在使用 Microsoft Speech API，特别是 C# 中 System.Speech.Recognition 中的 RecognitionEngine。我构建了一个语法如下：

// validCommands is a Choices object containing all valid command strings
// recognizer is a RecognitionEngine
GrammarBuilder builder = new GrammarBuilder(recognitionSystemName);
builder.Append(validCommands);
recognizer.SetInputToDefaultAudioDevice();
recognizer.LoadGrammar(new Grammar(builder));
recognizer.RecognizeAsync(RecognizeMode.Multiple);

// etc ...

当我实际给它一个命令时，这似乎非常适合这种情况。它还没有错误识别我的命令之一。不幸的是，它也倾向于将随机谈话作为命令！我尝试通过在命令 Choices 对象前面添加一个“名称”(recognitionSystemName) 来改善这个问题，我将系统称为“名称”。奇怪的是，这似乎没有帮助。我将其限制为一组预先确定的命令短语，因此我认为它能够检测语音是否不是任何字符串。我最好的猜测是，它假设所有声音都是命令，并从命令集中选择最佳匹配。任何有关改进该系统以使其不再触发非针对它的对话的建议都会非常有帮助。

编辑：我已将名称识别器移至单独的 SpeechRecognitionEngine，但准确性很糟糕。下面是我为检查准确性而编写的一些测试代码：

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Speech.Recognition;

namespace RecognitionAccuracyTest
{
    class RecognitionAccuracyTest
    {
        static int recogcount;
        [STAThread]
        static void Main()
        {
            recogcount = 0;
            System.Console.WriteLine("Beginning speech recognition accuracy test.");

            SpeechRecognitionEngine recognizer;
            recognizer = new SpeechRecognitionEngine(new System.Globalization.CultureInfo("en-US"));
            recognizer.SetInputToDefaultAudioDevice();
            recognizer.LoadGrammar(new Grammar(new GrammarBuilder("Octavian")));
            recognizer.SpeechHypothesized += new EventHandler<SpeechHypothesizedEventArgs>(recognizer_SpeechHypothesized);
            recognizer.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(recognizer_SpeechRecognized);
            recognizer.RecognizeAsync(RecognizeMode.Multiple);

            while (true) ;
        }

        static void recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
        {
            System.Console.WriteLine("Recognized @ " + e.Result.Confidence);
            try
            {
                if (e.Result.Audio != null)
                {
                    System.IO.FileStream stream = new System.IO.FileStream("audio" + ++recogcount + ".wav", System.IO.FileMode.Create);
                    e.Result.Audio.WriteToWaveStream(stream);
                    stream.Close();
                }
            }
            catch (Exception) { }
        }

        static void recognizer_SpeechHypothesized(object sender, SpeechHypothesizedEventArgs e)
        {
            System.Console.WriteLine("Hypothesized @ " + e.Result.Confidence);
        }
    }
}

如果名称是“Octavian”，它会识别“Octopus”、“Octagon”、“Volkswagen”和“哇，真的吗？”等内容。我可以清楚地听到相关音频片段的差异。任何能让这变得不那么糟糕的想法都会很棒。

原文

I'm working on a personal project involving microphones in my apartment that I can issue verbal commands to. To accomplish this, I've been using the Microsoft Speech API, and specifically RecognitionEngine from System.Speech.Recognition in C#. I construct a grammar as follows:

// validCommands is a Choices object containing all valid command strings
// recognizer is a RecognitionEngine
GrammarBuilder builder = new GrammarBuilder(recognitionSystemName);
builder.Append(validCommands);
recognizer.SetInputToDefaultAudioDevice();
recognizer.LoadGrammar(new Grammar(builder));
recognizer.RecognizeAsync(RecognizeMode.Multiple);

// etc ...

This seems to work pretty well for the case when I actually give it a command. It hasn't misidentified one of my commands yet. Unfortunately, it also tends to pick up random talking as commands! I've tried to ameliorate this by prefacing the command Choices object with a "name" (recognitionSystemName), which I address the system as. Oddly, this doesn't seem to help. I am restricting it to a set of predetermined command phrases, so I would have thought that it would be able to detect if speech wasn't any of the strings. My best guess is that it's assuming that all sound is a command and picking the best match from the command set. Any advice on improving this system so that it no longer triggers off of conversation not directed at it would be very helpful.

Edit: I've moved the name recognizer to a separate SpeechRecognitionEngine, but the accuracy is awful. Here's a bit of test code I wrote to examine the accuracy:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Speech.Recognition;

namespace RecognitionAccuracyTest
{
    class RecognitionAccuracyTest
    {
        static int recogcount;
        [STAThread]
        static void Main()
        {
            recogcount = 0;
            System.Console.WriteLine("Beginning speech recognition accuracy test.");

            SpeechRecognitionEngine recognizer;
            recognizer = new SpeechRecognitionEngine(new System.Globalization.CultureInfo("en-US"));
            recognizer.SetInputToDefaultAudioDevice();
            recognizer.LoadGrammar(new Grammar(new GrammarBuilder("Octavian")));
            recognizer.SpeechHypothesized += new EventHandler<SpeechHypothesizedEventArgs>(recognizer_SpeechHypothesized);
            recognizer.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(recognizer_SpeechRecognized);
            recognizer.RecognizeAsync(RecognizeMode.Multiple);

            while (true) ;
        }

        static void recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
        {
            System.Console.WriteLine("Recognized @ " + e.Result.Confidence);
            try
            {
                if (e.Result.Audio != null)
                {
                    System.IO.FileStream stream = new System.IO.FileStream("audio" + ++recogcount + ".wav", System.IO.FileMode.Create);
                    e.Result.Audio.WriteToWaveStream(stream);
                    stream.Close();
                }
            }
            catch (Exception) { }
        }

        static void recognizer_SpeechHypothesized(object sender, SpeechHypothesizedEventArgs e)
        {
            System.Console.WriteLine("Hypothesized @ " + e.Result.Confidence);
        }
    }
}

If the name is "Octavian", it recognizes stuff like "Octopus", "Octagon", "Volkswagen", and "Wow, really?". I can clearly hear the difference in the associated audio clips. Any ideas on making this not awful would be great.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

暮年 2024-12-20 07:35:40

让我确保我明白，您需要一个短语来分隔系统命令，例如“butler”或“Siri”。所以，你会说“巴特勒，打开电视”。您可以将其构建到您的语法中。

这是一个简单语法的示例，在识别命令之前需要一个开头短语。它使用语义结果来帮助您理解所说的内容。在这种情况下，用户必须说“打开”或“请打开”或“你可以打开吗”

    private Grammar CreateTestGrammar()
    {
        // item
        Choices item = new Choices();
        SemanticResultValue itemSRV;
        itemSRV = new SemanticResultValue("I E", "explorer");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("explorer", "explorer");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("firefox", "firefox");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("mozilla", "firefox");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("chrome", "chrome");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("google chrome", "chrome");
        item.Add(itemSRV);
        SemanticResultKey itemSemKey = new SemanticResultKey("item", item);

        //build the permutations of choices...
        GrammarBuilder gb = new GrammarBuilder();
        gb.Append(itemSemKey);


        //now build the complete pattern...
        GrammarBuilder itemRequest = new GrammarBuilder();
        //pre-amble "[I'd like] a"
        itemRequest.Append(new Choices("Can you open", "Open", "Please open"));

        itemRequest.Append(gb);

        Grammar TestGrammar = new Grammar(itemRequest);
        return TestGrammar;
    }

然后您可以使用以下内容处理语音：

RecognitionResult result = myRecognizer.Recognize();

并检查语义结果，例如：

if(result.Semantics.ContainsKey("item"))
{
   string s = (string)result.Semantics["item"].Value;
}

Let me make sure I understand, you want a phrase to set apart commands to the system, like "butler" or "Siri". So, you'll say "Butler, turn on TV". You can build this into your grammar.

Here is an example of a simple grammar that requires an opening phrase before it recognizes a command. It uses semantic results to help you understand what was said. In this case the user must say "Open" or "Please open" or "can you open"

    private Grammar CreateTestGrammar()
    {
        // item
        Choices item = new Choices();
        SemanticResultValue itemSRV;
        itemSRV = new SemanticResultValue("I E", "explorer");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("explorer", "explorer");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("firefox", "firefox");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("mozilla", "firefox");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("chrome", "chrome");
        item.Add(itemSRV);
        itemSRV = new SemanticResultValue("google chrome", "chrome");
        item.Add(itemSRV);
        SemanticResultKey itemSemKey = new SemanticResultKey("item", item);

        //build the permutations of choices...
        GrammarBuilder gb = new GrammarBuilder();
        gb.Append(itemSemKey);


        //now build the complete pattern...
        GrammarBuilder itemRequest = new GrammarBuilder();
        //pre-amble "[I'd like] a"
        itemRequest.Append(new Choices("Can you open", "Open", "Please open"));

        itemRequest.Append(gb);

        Grammar TestGrammar = new Grammar(itemRequest);
        return TestGrammar;
    }

You can then process the speech with something like:

RecognitionResult result = myRecognizer.Recognize();

and check for semantic results like:

if(result.Semantics.ContainsKey("item"))
{
   string s = (string)result.Semantics["item"].Value;
}

回复收藏 0 原文

夜空下最亮的亮点 2024-12-20 07:35:40

我也有同样的问题。
我使用的是 Microsoft 语音平台，因此在准确性等方面可能会略有不同。

我使用 Claire 作为唤醒命令，但它确实也将不同的单词识别为 Claire。问题是引擎会听到您说话并搜索最接近的匹配项。

我没有找到一个真正好的解决方案。
您可以尝试使用“置信度”字段过滤识别的语音。但对于我选择的识别器引擎来说，它不太可靠。
我只是将我想要识别的每个单词放入一个大的 SRGS.xml 中，并将重复值设置为 0-。我只接受公认的句子，因为克莱尔是第一个单词。
但这个解决方案不是我想要的，因为它没有我希望的那么好，但仍然是一个小小的改进。

我目前正忙于此事，随着进展我将发布更多信息。

编辑1：
作为对迪姆斯所说的评论：
可以在 SRGS 语法中添加“垃圾”规则。你可能想调查一下。
http://www.w3.org/TR/speech-grammar/