快速文本预处理
在我的项目中,我通常处理文本。我发现预处理可能非常慢。所以我想问你是否知道如何优化我的代码。流程是这样的:
获取HTML页面-> (纯文本 -> 词干提取 -> 删除停用词) ->进一步的文本处理
括号中是预处理步骤。应用程序运行时间约为 10.265 秒,但预处理需要 9.18 秒!这是预处理 50 个 HTML 页面(不包括下载)的时间。
我使用 HtmlAgilityPack 库将 HTML 转换为纯文本。这速度相当快。转换1个文档需要2.5ms,还算可以。
问题稍后才出现。截取一份文档最多需要 120 毫秒。不幸的是,这些 HTML 页面是波兰语的。不存在用 C# 编写的波兰语词干分析器。我知道只有 2 个可以免费使用的用 Java 编写的:stempel 和 morfologic。我在IKVM软件的帮助下将stempel.jar预编译为stempel.dll。所以没有什么可做的了。
消除停用词也需要花费大量时间(1 个文档约 70 毫秒)。它是这样完成的:
result = Regex.Replace(text.ToLower(), @"(([-]|[.]|[-.]|[0-9])?[0-9]*([.]|[,])*[0-9]+)|(\b\w{1,2}\b)|([^\w])", " ");
while (stopwords.MoveNext())
{
string stopword = stopwords.Current.ToString();
result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " ");
}
return result;
首先我删除所有数字、特殊字符、1 个和 2 个字母的单词。然后在循环中我删除停用词。大约有270个停用词。
有可能让它更快吗?
编辑:
我想要做的是删除所有不超过 2 个字母的单词。所以我想取出所有特殊字符(包括'.',',','?','!'等)数字,停止词。我只需要可用于数据挖掘的纯单词。
In my project I work with text in general. I found that preprocessing can be very slow. So I would like to ask you if you know how to optimize my code. The flow is like this:
get HTML page -> (To plain text -> stemming -> remove stop words) -> further text processing
In brackets there are preprocessing steps. The application runs in about 10.265s, but preprocessing takes 9.18s! This is time for preprocessing 50 HTML pages (excluding downloading).
I use HtmlAgilityPack library to convert HTML into plain text. This is quite fast. It takes 2.5ms to convert 1 document, so it's relatively OK.
Problem comes later. Stemming one document takes up to 120ms. Unfortunately those HTML pages are in Polish. There no exists stemmer for Polish language written in C#. I know only 2 free to use written in Java: stempel and morfologic. I precompiled stempel.jar to stempel.dll with help of IKVM software. So there is nothing more to do.
Eliminating stop words takes also a lot of time (~70ms for 1 doc). It is done like this:
result = Regex.Replace(text.ToLower(), @"(([-]|[.]|[-.]|[0-9])?[0-9]*([.]|[,])*[0-9]+)|(\b\w{1,2}\b)|([^\w])", " ");
while (stopwords.MoveNext())
{
string stopword = stopwords.Current.ToString();
result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " ");
}
return result;
First i remove all numbers, special characters, 1- and 2-letter words. Then in loop I remove stop words. There are about 270 stop words.
Is it possible to make it faster?
Edit:
What I want to do is remove everything which is not a word longer than 2 letters. So I want to get out all special chars (including '.', ',', '?', '!', etc.) numbers, stop words. I need only pure words that I can use for Data Mining.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
迭代地替换单词将成为实现中的最大瓶颈。在每次迭代中,您必须扫描整个字符串以查找停用词,然后替换操作必须分配一个新字符串并填充它替换后的文本。那不会很快。
一种更有效的方法是对字符串进行标记并以流式传输方式执行替换。将输入划分为各个单词,并用适当的空格或分隔符分隔。您可以逐步执行此操作,因此不需要分配任何额外的内存来执行此操作。对于每个单词(标记),您现在可以在停用词的哈希集中执行查找 - 如果找到匹配项,您将在将最终文本流式传输到单独的
StringBuilder
时替换它。如果令牌不是停用词,则只需将其不加修改地从StringBuilder
中流出即可。这种方法应该具有 O(n) 性能,因为它只扫描字符串一次并使用HashSet
执行停用词查找。以下是我希望表现更好的一种方法。虽然它不是完全流式传输(它使用 String.Split() 分配了一组附加字符串),但它在一次传递中完成了所有处理。优化代码以避免分配额外的字符串可能不会提供太大的改进,因为您仍然需要提取子字符串来执行与停用词的比较。
下面的代码返回一个单词列表,其中排除所有停用词以及结果中两个字母或更短的单词。它还对停用词使用不区分大小写的比较。
Iteratively replacing words is going to be the biggest bottleneck in your implementation. On each iteration, you have to scan the entire string for the stopword, then the replace operation has to allocate a new string and populate it with the text post-replacement. That's not going to be fast.
A much more efficient approach is to tokenize the string and perform replacement in a streaming fashion. Divide the input into individual words separated by whatever whitespace or separator characters are appropriate. You can do this incrementally so you don't need to allocate any additional memory to do so. For each word (token), you can now perform a lookup in a hashset of stopwords - if you find match, you will replace it as you stream out the final text to a separate
StringBuilder
. If the token is not a stopword, just stream it out theStringBuilder
unmodified. This approach should have O(n) performance, as it only scans the string once and uses aHashSet
to perform stopword lookup.Below is one approach that I would expect to perform better. While it isn't fully streaming (it uses
String.Split()
which allocated an array of additional strings), it does all of the processing in a single pass. Refining the code to avoid allocating additional string is probably not going to provide much of an improvement, since you still need to extract out substrings to perform comparisons to your stopwords.Code below returns a list of words that excludes all stopwords and words two letters or shorter form the result. It uses case-insensitive comparison on the stopwords as well.
与其在循环中使用正则表达式替换,为什么不动态构造一个与任何一个停用词匹配的怪物匹配正则表达式,然后运行一个替换,将其替换为任何内容?如果您的停用词是“what”、“ok”和“yeah”,则类似于
"\b(what|ok|yeah)\b"
。这看起来可能会更有效率。Instead of having the regex replace in the loop, why not dynamically construct a monster matching regex that matches any one of your stopwords, and then run one replace, replacing it with nothing? Something like
"\b(what|ok|yeah)\b"
if your stopwords are "what", "ok", and "yeah". That seems like it would probably be more efficient.好吧,我知道 SO 不是一个纯粹的论坛,也许我不应该回答我自己的问题,但我想分享我的结果。
最后,感谢你们,我成功地优化了我的文本预处理。首先,我从我的问题中简化了长表达(遵循乔什·凯利的回答):
它与第一个表达相同,但非常简单。然后再次按照 Josh Kelley 的建议,我将这个正则表达式放入汇编中。我在此处找到了将表达式编译为程序集的绝佳示例。我这样做是因为这个正则表达式被使用了很多很多次。在听了几篇有关编译正则表达式的文章后,这是我的决定。我在消除停用词后删除了最后一个表达式(这没有真正的意义)。
因此 12KiB 文本文件的执行时间约为 15 毫秒。这只是为了上面提到的表达。
最后一步是停用词。我决定对 3 个不同的选项进行测试(执行时间针对相同的 12KiB 文本文件)。
一个大的正则表达式,
包含所有停用词并编译成程序集(mquander 的建议)。这里没有什么需要澄清的。
String.Replace()
人们说这比 Regex 更快。因此,对于每个停用词,我使用了 string.Replace() 方法。结果需要进行许多循环:
LINQ
LBushkin 提出的 方法。没什么可说的了。
我只能说哇。只需比较第一个和最后一个的执行时间即可!非常感谢LBushkin!
OK, I know that SO is not a pure forum and maybe I shouldn't answer my own question but I'd like to share with my results.
Finally, thanks to you guys, I managed to get better optimization of my text preprocessing. First of all I made simpler that long expression from my question (following Josh Kelley's answer):
It does the same as first one but is very simple. Then following Josh Kelley's suggestion again I put this regex into assembly. Great example of compiling expressions into assembly I found here. I did that, because this regex is used many, many times. After lecture of few articles about compiled regex, that was my decision. I removed the last expression after eliminating stop words (no real sense with that).
So the execution time on 12KiB text file was ~15ms. This is only for expression mentioned above.
Last step were stop words. I decided to make a test for 3 different options (Execution times are for the same 12KiB text file).
One big Regular Expression
with all stop words and compiled into assembly (mquander's suggestion). Nothing to clear here.
String.Replace()
People say that this can be faster than Regex. So for each stop word I used
string.Replace()
method. Many loops to take with result:LINQ
method presented by LBushkin. Nothing to say more.
I can only say wow. Just compare execution times of first one with the last one! Big thanks LBushkin!
加速您的正则表达式
您的正则表达式可能需要一些工作。
例如,这一行:
使用括号捕获停用词以供以后使用,然后就永远不会使用它。也许 .NET 正则表达式引擎足够聪明,可以在这种情况下跳过捕获,也许不会。
这个正则表达式太复杂了:
"([-]|[.]|[-.]|[0-9])?"
与"([-.0- 9])?”
。 (除非您尝试将“-.”匹配为您的可能性之一?我假设现在不需要。)如果您不需要捕获(并且在您的示例中不需要),那么它与 <代码>“[-.0-9]?”。"[-.0-9]?"
在"[0-9]*"
之前有点多余。您可以进一步将其简化为“[-.]?[0-9]*”
。"([.]|[,])*"
与"[,.]*"
相同。最后,测试是否编译你的正则表达式< /a> 可以产生更好的性能。
减少正则表达式和字符串操作
构造一堆字符串,组成一堆正则表达式对象,然后丢弃它们,就像您在这个循环中所做的那样,可能不是很快:
尝试 pre - 将停用词处理为一组正则表达式对象或创建一个巨大的预编译正则表达式(正如其他人所建议的那样)。
重构您的算法
您似乎只对处理词干、非停用词、文本感兴趣,而不是标点符号、数字等。
为此,您的算法采用以下方法:
我开始在这里写另一种方法,但 LBushkin 抢先了我。按他说的做。请记住,作为一般规则,更改算法通常会比改进正则表达式使用等微观优化带来更大的改进。
Speed up your regexes
Your regexes could use some work.
For example, this line:
uses parentheses to capture the stopword for later use, then it never uses it. Maybe the .NET regex engine is smart enough to skip capturing in this case, maybe not.
This regex is way too complicated:
"([-]|[.]|[-.]|[0-9])?"
is identical to"([-.0-9])?"
. (Unless you're trying to match "-." as one of your possibilities? I'll assume not for now.) If you don't need capturing (and you don't in your example), then it's identical to"[-.0-9]?"
."[-.0-9]?"
is a bit redundant coming before"[0-9]*"
. You can further simplify it to"[-.]?[0-9]*"
."([.]|[,])*"
is identical to"[,.]*"
.Finally, test if compiling your regexes can yield better performance.
Cut down on Regex and string manipulation
Constructing a bunch of strings, making up a bunch of Regex objects, and then discarding them, as you're doing in this loop, is probably not very fast:
Try pre-processing stopwords into an array of Regex objects or creating one monster pre-compiled Regex (as others have suggested).
Restructure your algorithm
It looks like you're only interested in processing stemmed, non-stopword, text, and not punctuation, numbers, etc.
To do that, your algorithm takes the following approach:
I started to write another approach here, but LBushkin beat me to it. Do what he says. Keep in mind, as a general rule, that changing your algorithm usually gives bigger improvements than micro-optimizations like improving your regex usage.
您可能会遇到Schlemiel the Painter 问题。在 C#(和其他语言)中,当您追加或连接字符串时,您实际上是在创建一个全新的字符串。在循环中执行此操作通常会导致大量内存分配,否则这些内存分配是可以避免的。
You might be running into the Schlemiel the Painter problem. In C# (and other languages) when you append or concatenate strings, you are actually creating an entirely new string. Doing this in a loop often causes a LOT of memory allocation which can otherwise be avoided.
我同意 mquander 的观点,这里有更多信息。
每次使用正则表达式时,C# 都会创建一个表来匹配文本。如果您只调用 regex 函数几次,这很好,但您在这里所做的是创建大约 270 个新表,并为每个 html 文档丢弃它们。
我想尝试的只是创建一个 Regex 对象,然后使用 |运算符来匹配所有不同的停用词和第一个过滤器。之后,您应该使用正则表达式编译来汇编,以便让 JIT 生成机器代码。
http://en.csharp-online.net/CSharp_Regular_Expression_Recipes%E2%80% 94Compiling_Regular_Expressions
你应该会看到一个显着的加速
I agree with mquander, and here's a bit more info.
Every time you use an regex, C# will create a table to match the text. This is fine and dandy if you only call the regex function a couple of times, but what you are doing here is creating around 270 new tables and trashing them for each html document.
What I would try is just creating one Regex object, and use the | operator to match all the different stop words and the first filter. After that, you should use regex compilation to assembly in order to let the JIT generate machine code.
http://en.csharp-online.net/CSharp_Regular_Expression_Recipes%E2%80%94Compiling_Regular_Expressions
You should see a dramatic speedup with just this