词审查的最佳方法 - C# 4.0

发布于 2024-12-11 11:16:22 字数 331 浏览 4 评论 0 原文

对于我定制的聊天屏幕,我使用下面的代码来检查审查的单词。但我想知道这段代码的性能是否可以提高。谢谢。

    if (srMessageTemp.IndexOf(" censored1 ") != -1)
        return;
    if (srMessageTemp.IndexOf(" censored2 ") != -1)
        return;
    if (srMessageTemp.IndexOf(" censored3 ") != -1)
        return;

C# 4.0 。实际上列表要长得多,但我不会放在这里,因为它会消失。

For my custom made chat screen i am using the code below for checking censored words. But i wonder can this code performance improved. Thank you.

    if (srMessageTemp.IndexOf(" censored1 ") != -1)
        return;
    if (srMessageTemp.IndexOf(" censored2 ") != -1)
        return;
    if (srMessageTemp.IndexOf(" censored3 ") != -1)
        return;

C# 4.0 . actually list is a lot more long but i don't put here as it goes away.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

峩卟喜欢 2024-12-18 11:16:22

你可以简化它。这里 listOfCencoredWords 将包含所有经过审查的单词

 if (listOfCensoredWords.Any(item => srMessageTemp.Contains(item)))
     return;

You can simplify it. Here listOfCencoredWords will contains all the censored words

 if (listOfCensoredWords.Any(item => srMessageTemp.Contains(item)))
     return;
儭儭莪哋寶赑 2024-12-18 11:16:22

如果你想让它变得非常快,你可以使用 Aho-Corasick 自动机。这就是防病毒软件一次检查数千种病毒的方式。但我不知道在哪里可以完成实现,因此与仅使用简单的慢速方法(如正则表达式)相比,它需要您做更多的工作。

请参阅此处的理论:http://en.wikipedia.org/wiki/Aho-Corasick

If you want to make it really fast, you can use Aho-Corasick automaton. This is how antivirus software checks thousands of viruses at once. But I don't know where you can get the implementation done, so it will require much more work from you compared to using just simple slow methods like regular expressions.

See the theory here: http://en.wikipedia.org/wiki/Aho-Corasick

铁轨上的流浪者 2024-12-18 11:16:22

首先,我希望您没有真正“标记”所写的单词。你知道,仅仅因为有人不在坏词前加空格,并不会让这个词变得不那么坏:-) 示例 ,badword,

我会说我会使用正则表达式这里:-)我不确定正则表达式或人造解析器是否会更快,但至少正则表达式将是一个很好的起点。正如其他人所写,您首先将文本拆分为单词,然后检查 HashSet

我正在添加基于 ArraySegment 的第二个版本的代码。这个我稍后再说。

class Program
{
    class ArraySegmentComparer : IEqualityComparer<ArraySegment<char>>
    {
        public bool Equals(ArraySegment<char> x, ArraySegment<char> y)
        {
            if (x.Count != y.Count)
            {
                return false;
            }

            int end = x.Offset + x.Count;

            for (int i = x.Offset, j = y.Offset; i < end; i++, j++)
            {
                if (!x.Array[i].ToString().Equals(y.Array[j].ToString(), StringComparison.InvariantCultureIgnoreCase))
                {
                    return false;
                }
            }

            return true;
        }

        public override int GetHashCode(ArraySegment<char> obj)
        {
            unchecked
            {
                int hash = 17;

                int end = obj.Offset + obj.Count;

                int i;

                for (i = obj.Offset; i < end; i++)
                {
                    hash *= 23;
                    hash += Char.ToUpperInvariant(obj.Array[i]);
                }

                return hash;
            }
        }
    }

    static void Main()
    {
        var rx = new Regex(@"\b\w+\b", RegexOptions.Compiled);

        var sampleText = @"For my custom made chat screen i am using the code below for checking censored words. But i wonder can this code performance improved. Thank you.

if (srMessageTemp.IndexOf("" censored1 "") != -1)
return;
if (srMessageTemp.IndexOf("" censored2 "") != -1)
return;
if (srMessageTemp.IndexOf("" censored3 "") != -1)
return;
C# 4.0 . actually list is a lot more long but i don't put here as it goes away.

And now some accented letters àèéìòù and now some letters with unicode combinable diacritics àèéìòù";

        //sampleText += sampleText;
        //sampleText += sampleText;
        //sampleText += sampleText;
        //sampleText += sampleText;
        //sampleText += sampleText;
        //sampleText += sampleText;
        //sampleText += sampleText;

        HashSet<string> prohibitedWords = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase) { "For", "custom", "combinable", "away" };

        Stopwatch sw1 = Stopwatch.StartNew();

        var words = rx.Matches(sampleText);

        foreach (Match word in words)
        {
            string str = word.Value;

            if (prohibitedWords.Contains(str))
            {
                Console.Write(str);
                Console.Write(" ");
            }
            else
            {
                //Console.WriteLine(word);
            }
        }

        sw1.Stop();

        Console.WriteLine();
        Console.WriteLine();

        HashSet<ArraySegment<char>> prohibitedWords2 = new HashSet<ArraySegment<char>>(
            prohibitedWords.Select(p => new ArraySegment<char>(p.ToCharArray())),
            new ArraySegmentComparer());

        var sampleText2 = sampleText.ToCharArray();

        Stopwatch sw2 = Stopwatch.StartNew();

        int startWord = -1;

        for (int i = 0; i < sampleText2.Length; i++)
        {
            if (Char.IsLetter(sampleText2[i]) || Char.IsDigit(sampleText2[i]))
            {
                if (startWord == -1)
                {
                    startWord = i;
                }
            }
            else
            {
                if (startWord != -1)
                {
                    int length = i - startWord;

                    if (length != 0)
                    {
                        var wordSegment = new ArraySegment<char>(sampleText2, startWord, length);

                        if (prohibitedWords2.Contains(wordSegment))
                        {
                            Console.Write(sampleText2, startWord, length);
                            Console.Write(" ");
                        }
                        else
                        {
                            //Console.WriteLine(sampleText2, startWord, length);
                        }
                    }

                    startWord = -1;
                }
            }
        }

        if (startWord != -1)
        {
            int length = sampleText2.Length - startWord;

            if (length != 0)
            {
                var wordSegment = new ArraySegment<char>(sampleText2, startWord, length);

                if (prohibitedWords2.Contains(wordSegment))
                {
                    Console.Write(sampleText2, startWord, length);
                    Console.Write(" ");
                }
                else
                {
                    //Console.WriteLine(sampleText2, startWord, length);
                }
            }
        }

        sw2.Stop();

        Console.WriteLine();
        Console.WriteLine();

        Console.WriteLine(sw1.ElapsedTicks);
        Console.WriteLine(sw2.ElapsedTicks);
    }
}

我会注意到,您可以更快地在原始字符串中进行解析。这意味着什么:如果您将“文档”细分为单词,并将每个单词放入一个 string 中,显然您正在创建 n string ,文档的每个单词一个。但如果跳过这一步,直接对文档进行操作,只保留当前索引和当前单词的长度呢?那么就会更快了!显然,您需要为 HashSet 创建一个特殊的比较器。

但是等等! C# 有类似的东西...它称为 ArraySegment。因此,您的文档将是 char[] 而不是 string,每个单词将是一个 ArraySegment。显然这要复杂得多!您不能简单地使用 Regexes,您必须“手动”构建一个解析器(但我认为转换 \b\w+\b 表达式会非常容易)。为 HashSet 创建比较器会有点复杂(提示:您将使用 HashSet> 并且要审查的单词将是ArraySegment“指向”单词的 char[],其大小等于char[].Length,如 var word = new ArraySegment("tobecensored".ToCharArray());)

经过一些简单的基准测试,我可以看到使用ArraySegment 的程序的未优化版本与Regex 版本对于较短的文本一样快。这可能是因为如果一个单词的长度为 4-6 个字符,则复制它的速度与复制 ArraySegmentArraySegment< /code> 是 12 个字节,6 个字符的单词是 12 个字节。除此之外,我们还必须添加一些开销......但最终数字是可比较的)。但对于较长的文本(尝试取消注释 //sampleText += SampleText;),它在 Release -> 中变得更快一点(10%)。开始而不调试 (CTRL-F5)

我会注意到逐个字符比较字符串是错误。您应该始终使用 string 类(或操作系统)提供的方法。他们比你更知道如何处理“奇怪”的情况(并且在 Unicode 中没有任何“正常”的情况:-))

First, I hope you aren't really "tokenizing" the words as written. You know, just because someone doesn't put a space before a bad word, it doesn't make the word less bad :-) Example ,badword,

I'll say that I would use a Regex here :-) I'm not sure if a Regex or a man-made parser would be faster, but at least a Regex would be a good starting point. As others wrote, you begin by splitting the text in words and then checking an HashSet<string>.

I'm adding a second version of the code, based on ArraySegment<char>. I speak later of this.

class Program
{
    class ArraySegmentComparer : IEqualityComparer<ArraySegment<char>>
    {
        public bool Equals(ArraySegment<char> x, ArraySegment<char> y)
        {
            if (x.Count != y.Count)
            {
                return false;
            }

            int end = x.Offset + x.Count;

            for (int i = x.Offset, j = y.Offset; i < end; i++, j++)
            {
                if (!x.Array[i].ToString().Equals(y.Array[j].ToString(), StringComparison.InvariantCultureIgnoreCase))
                {
                    return false;
                }
            }

            return true;
        }

        public override int GetHashCode(ArraySegment<char> obj)
        {
            unchecked
            {
                int hash = 17;

                int end = obj.Offset + obj.Count;

                int i;

                for (i = obj.Offset; i < end; i++)
                {
                    hash *= 23;
                    hash += Char.ToUpperInvariant(obj.Array[i]);
                }

                return hash;
            }
        }
    }

    static void Main()
    {
        var rx = new Regex(@"\b\w+\b", RegexOptions.Compiled);

        var sampleText = @"For my custom made chat screen i am using the code below for checking censored words. But i wonder can this code performance improved. Thank you.

if (srMessageTemp.IndexOf("" censored1 "") != -1)
return;
if (srMessageTemp.IndexOf("" censored2 "") != -1)
return;
if (srMessageTemp.IndexOf("" censored3 "") != -1)
return;
C# 4.0 . actually list is a lot more long but i don't put here as it goes away.

And now some accented letters àèéìòù and now some letters with unicode combinable diacritics àèéìòù";

        //sampleText += sampleText;
        //sampleText += sampleText;
        //sampleText += sampleText;
        //sampleText += sampleText;
        //sampleText += sampleText;
        //sampleText += sampleText;
        //sampleText += sampleText;

        HashSet<string> prohibitedWords = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase) { "For", "custom", "combinable", "away" };

        Stopwatch sw1 = Stopwatch.StartNew();

        var words = rx.Matches(sampleText);

        foreach (Match word in words)
        {
            string str = word.Value;

            if (prohibitedWords.Contains(str))
            {
                Console.Write(str);
                Console.Write(" ");
            }
            else
            {
                //Console.WriteLine(word);
            }
        }

        sw1.Stop();

        Console.WriteLine();
        Console.WriteLine();

        HashSet<ArraySegment<char>> prohibitedWords2 = new HashSet<ArraySegment<char>>(
            prohibitedWords.Select(p => new ArraySegment<char>(p.ToCharArray())),
            new ArraySegmentComparer());

        var sampleText2 = sampleText.ToCharArray();

        Stopwatch sw2 = Stopwatch.StartNew();

        int startWord = -1;

        for (int i = 0; i < sampleText2.Length; i++)
        {
            if (Char.IsLetter(sampleText2[i]) || Char.IsDigit(sampleText2[i]))
            {
                if (startWord == -1)
                {
                    startWord = i;
                }
            }
            else
            {
                if (startWord != -1)
                {
                    int length = i - startWord;

                    if (length != 0)
                    {
                        var wordSegment = new ArraySegment<char>(sampleText2, startWord, length);

                        if (prohibitedWords2.Contains(wordSegment))
                        {
                            Console.Write(sampleText2, startWord, length);
                            Console.Write(" ");
                        }
                        else
                        {
                            //Console.WriteLine(sampleText2, startWord, length);
                        }
                    }

                    startWord = -1;
                }
            }
        }

        if (startWord != -1)
        {
            int length = sampleText2.Length - startWord;

            if (length != 0)
            {
                var wordSegment = new ArraySegment<char>(sampleText2, startWord, length);

                if (prohibitedWords2.Contains(wordSegment))
                {
                    Console.Write(sampleText2, startWord, length);
                    Console.Write(" ");
                }
                else
                {
                    //Console.WriteLine(sampleText2, startWord, length);
                }
            }
        }

        sw2.Stop();

        Console.WriteLine();
        Console.WriteLine();

        Console.WriteLine(sw1.ElapsedTicks);
        Console.WriteLine(sw2.ElapsedTicks);
    }
}

I'll note that you could go faster doing the parsing "in" the original string. What does this means: if you subdivide the "document" in words and each word is put in a string, clearly you are creating n string, one for each word of your document. But what if you skipped this step and operated directly on the document, simply keeping the current index and the length of the current word? Then it would be faster! Clearly then you would need to create a special comparer for the HashSet<>.

But wait! C# has something similar... It's called ArraySegment. So your document would be a char[] instead of a string and each word would be an ArraySegment<char>. Clearly this is much more complex! You can't simply use Regexes, you have to build "by hand" a parser (but I think converting the \b\w+\b expression would be quite easy). And creating a comparer for HashSet<char> would be a little complex (hint: you would use HashSet<ArraySegment<char>> and the words to be censored would be ArraySegments "pointing" to a char[] of a word and with size equal to the char[].Length, like var word = new ArraySegment<char>("tobecensored".ToCharArray());)

After some simple benchmark, I can see that an unoptimized version of the program using ArraySegment<string> is as much fast as the Regex version for shorter texts. This probably because if a word is 4-6 char long, it's as much "slow" to copy it around than it's to copy around an ArraySegment<char> (an ArraySegment<char> is 12 bytes, a word of 6 characters is 12 bytes. On top of both of these we have to add a little overhead... But in the end the numbers are comparable). But for longer texts (try decommenting the //sampleText += sampleText;) it becomes a little faster (10%) in Release -> Start Without Debugging (CTRL-F5)

I'll note that comparing strings character by character is wrong. You should always use the methods given to you by the string class (or by the OS). They know how to handle "strange" cases much better than you (and in Unicode there isn't any "normal" case :-) )

怪异←思 2024-12-18 11:16:22

您可以使用 linq 来实现此目的,但如果您使用列表来保存审查值列表,则不需要这样做。下面的解决方案使用内置列表函数,并允许您不区分大小写进行搜索。

private static List<string> _censoredWords = new List<string>()
                                                  {
                                                      "badwordone1",
                                                      "badwordone2",
                                                      "badwordone3",
                                                      "badwordone4",
                                                  };


        static void Main(string[] args)
        {
            string badword1 = "BadWordOne2";
            bool censored = ShouldCensorWord(badword1);
        }

        private static bool ShouldCensorWord(string word)
        {
            return _censoredWords.Contains(word.ToLower());
        }

You can use linq for this but it's not required if you use a list to hold your list of censored values. The solution below uses the build in list functions and allows you to do your searches case insensitive.

private static List<string> _censoredWords = new List<string>()
                                                  {
                                                      "badwordone1",
                                                      "badwordone2",
                                                      "badwordone3",
                                                      "badwordone4",
                                                  };


        static void Main(string[] args)
        {
            string badword1 = "BadWordOne2";
            bool censored = ShouldCensorWord(badword1);
        }

        private static bool ShouldCensorWord(string word)
        {
            return _censoredWords.Contains(word.ToLower());
        }
养猫人 2024-12-18 11:16:22

您对此有何看法:

string[] censoredWords = new[] { " censored1 ", " censored2 ", " censored3 " };

if (censoredWords.Contains(srMessageTemp))
   return;

What you think about this:

string[] censoredWords = new[] { " censored1 ", " censored2 ", " censored3 " };

if (censoredWords.Contains(srMessageTemp))
   return;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文