词审查的最佳方法 - C# 4.0
对于我定制的聊天屏幕,我使用下面的代码来检查审查的单词。但我想知道这段代码的性能是否可以提高。谢谢。
if (srMessageTemp.IndexOf(" censored1 ") != -1)
return;
if (srMessageTemp.IndexOf(" censored2 ") != -1)
return;
if (srMessageTemp.IndexOf(" censored3 ") != -1)
return;
C# 4.0 。实际上列表要长得多,但我不会放在这里,因为它会消失。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我会为此使用 LINQ 或正则表达式:
I would use LINQ or regular expression for this:
你可以简化它。这里 listOfCencoredWords 将包含所有经过审查的单词
You can simplify it. Here listOfCencoredWords will contains all the censored words
如果你想让它变得非常快,你可以使用 Aho-Corasick 自动机。这就是防病毒软件一次检查数千种病毒的方式。但我不知道在哪里可以完成实现,因此与仅使用简单的慢速方法(如正则表达式)相比,它需要您做更多的工作。
请参阅此处的理论:http://en.wikipedia.org/wiki/Aho-Corasick
If you want to make it really fast, you can use Aho-Corasick automaton. This is how antivirus software checks thousands of viruses at once. But I don't know where you can get the implementation done, so it will require much more work from you compared to using just simple slow methods like regular expressions.
See the theory here: http://en.wikipedia.org/wiki/Aho-Corasick
首先,我希望您没有真正“标记”所写的单词。你知道,仅仅因为有人不在坏词前加空格,并不会让这个词变得不那么坏:-) 示例
,badword,
我会说我会使用正则表达式这里:-)我不确定正则表达式或人造解析器是否会更快,但至少正则表达式将是一个很好的起点。正如其他人所写,您首先将文本拆分为单词,然后检查
HashSet
。我正在添加基于 ArraySegment 的第二个版本的代码。这个我稍后再说。
我会注意到,您可以更快地在原始字符串中进行解析。这意味着什么:如果您将“文档”细分为单词,并将每个单词放入一个
string
中,显然您正在创建n
string
,文档的每个单词一个。但如果跳过这一步,直接对文档进行操作,只保留当前索引和当前单词的长度呢?那么就会更快了!显然,您需要为HashSet
创建一个特殊的比较器。但是等等! C# 有类似的东西...它称为 ArraySegment。因此,您的文档将是
char[]
而不是string
,每个单词将是一个ArraySegment
。显然这要复杂得多!您不能简单地使用Regex
es,您必须“手动”构建一个解析器(但我认为转换\b\w+\b
表达式会非常容易)。为HashSet
创建比较器会有点复杂(提示:您将使用HashSet>
并且要审查的单词将是ArraySegment
“指向”单词的char[]
,其大小等于char[].Length
,如var word = new ArraySegment("tobecensored".ToCharArray());
)经过一些简单的基准测试,我可以看到使用
ArraySegment
的程序的未优化版本与Regex
版本对于较短的文本一样快。这可能是因为如果一个单词的长度为 4-6 个字符,则复制它的速度与复制ArraySegment
(ArraySegment< /code> 是 12 个字节,6 个字符的单词是 12 个字节。除此之外,我们还必须添加一些开销......但最终数字是可比较的)。但对于较长的文本(尝试取消注释
//sampleText += SampleText;
),它在 Release -> 中变得更快一点(10%)。开始而不调试 (CTRL-F5)我会注意到逐个字符比较字符串是错误。您应该始终使用 string 类(或操作系统)提供的方法。他们比你更知道如何处理“奇怪”的情况(并且在 Unicode 中没有任何“正常”的情况:-))
First, I hope you aren't really "tokenizing" the words as written. You know, just because someone doesn't put a space before a bad word, it doesn't make the word less bad :-) Example
,badword,
I'll say that I would use a Regex here :-) I'm not sure if a Regex or a man-made parser would be faster, but at least a Regex would be a good starting point. As others wrote, you begin by splitting the text in words and then checking an
HashSet<string>
.I'm adding a second version of the code, based on
ArraySegment<char>
. I speak later of this.I'll note that you could go faster doing the parsing "in" the original string. What does this means: if you subdivide the "document" in words and each word is put in a
string
, clearly you are creatingn
string
, one for each word of your document. But what if you skipped this step and operated directly on the document, simply keeping the current index and the length of the current word? Then it would be faster! Clearly then you would need to create a special comparer for theHashSet<>
.But wait! C# has something similar... It's called ArraySegment. So your document would be a
char[]
instead of astring
and each word would be anArraySegment<char>
. Clearly this is much more complex! You can't simply useRegex
es, you have to build "by hand" a parser (but I think converting the\b\w+\b
expression would be quite easy). And creating a comparer forHashSet<char>
would be a little complex (hint: you would useHashSet<ArraySegment<char>>
and the words to be censored would beArraySegment
s "pointing" to achar[]
of a word and with size equal to thechar[].Length
, likevar word = new ArraySegment<char>("tobecensored".ToCharArray());
)After some simple benchmark, I can see that an unoptimized version of the program using
ArraySegment<string>
is as much fast as theRegex
version for shorter texts. This probably because if a word is 4-6 char long, it's as much "slow" to copy it around than it's to copy around anArraySegment<char>
(anArraySegment<char>
is 12 bytes, a word of 6 characters is 12 bytes. On top of both of these we have to add a little overhead... But in the end the numbers are comparable). But for longer texts (try decommenting the//sampleText += sampleText;
) it becomes a little faster (10%) in Release -> Start Without Debugging (CTRL-F5)I'll note that comparing strings character by character is wrong. You should always use the methods given to you by the
string
class (or by the OS). They know how to handle "strange" cases much better than you (and in Unicode there isn't any "normal" case :-) )您可以使用 linq 来实现此目的,但如果您使用列表来保存审查值列表,则不需要这样做。下面的解决方案使用内置列表函数,并允许您不区分大小写进行搜索。
You can use linq for this but it's not required if you use a list to hold your list of censored values. The solution below uses the build in list functions and allows you to do your searches case insensitive.
您对此有何看法:
What you think about this: