快速文本预处理

发布于 2024-09-12 05:00:29 字数 1049 浏览 6 评论 0原文

在我的项目中，我通常处理文本。我发现预处理可能非常慢。所以我想问你是否知道如何优化我的代码。流程是这样的：

获取HTML页面-> （纯文本 -> 词干提取 -> 删除停用词） ->进一步的文本处理

括号中是预处理步骤。应用程序运行时间约为 10.265 秒，但预处理需要 9.18 秒！这是预处理 50 个 HTML 页面（不包括下载）的时间。

我使用 HtmlAgilityPack 库将 HTML 转换为纯文本。这速度相当快。转换1个文档需要2.5ms，还算可以。

问题稍后才出现。截取一份文档最多需要 120 毫秒。不幸的是，这些 HTML 页面是波兰语的。不存在用 C# 编写的波兰语词干分析器。我知道只有 2 个可以免费使用的用 Java 编写的：stempel 和 morfologic。我在IKVM软件的帮助下将stempel.jar预编译为stempel.dll。所以没有什么可做的了。

消除停用词也需要花费大量时间（1 个文档约 70 毫秒）。它是这样完成的：


result = Regex.Replace(text.ToLower(), @"(([-]|[.]|[-.]|[0-9])?[0-9]*([.]|[,])*[0-9]+)|(\b\w{1,2}\b)|([^\w])", " ");
while (stopwords.MoveNext())
{
   string stopword = stopwords.Current.ToString();                
   result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " ");                               
}
return result;

首先我删除所有数字、特殊字符、1 个和 2 个字母的单词。然后在循环中我删除停用词。大约有270个停用词。

有可能让它更快吗？

编辑：

我想要做的是删除所有不超过 2 个字母的单词。所以我想取出所有特殊字符（包括'.'，'，'，'？'，'！'等）数字，停止词。我只需要可用于数据挖掘的纯单词。

原文

In my project I work with text in general. I found that preprocessing can be very slow. So I would like to ask you if you know how to optimize my code. The flow is like this:

get HTML page -> (To plain text -> stemming -> remove stop words) -> further text processing

In brackets there are preprocessing steps. The application runs in about 10.265s, but preprocessing takes 9.18s! This is time for preprocessing 50 HTML pages (excluding downloading).

I use HtmlAgilityPack library to convert HTML into plain text. This is quite fast. It takes 2.5ms to convert 1 document, so it's relatively OK.

Problem comes later. Stemming one document takes up to 120ms. Unfortunately those HTML pages are in Polish. There no exists stemmer for Polish language written in C#. I know only 2 free to use written in Java: stempel and morfologic. I precompiled stempel.jar to stempel.dll with help of IKVM software. So there is nothing more to do.

Eliminating stop words takes also a lot of time (~70ms for 1 doc). It is done like this:


result = Regex.Replace(text.ToLower(), @"(([-]|[.]|[-.]|[0-9])?[0-9]*([.]|[,])*[0-9]+)|(\b\w{1,2}\b)|([^\w])", " ");
while (stopwords.MoveNext())
{
   string stopword = stopwords.Current.ToString();                
   result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " ");                               
}
return result;

First i remove all numbers, special characters, 1- and 2-letter words. Then in loop I remove stop words. There are about 270 stop words.

Is it possible to make it faster?

Edit:

What I want to do is remove everything which is not a word longer than 2 letters. So I want to get out all special chars (including '.', ',', '?', '!', etc.) numbers, stop words. I need only pure words that I can use for Data Mining.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

二智少女 2024-09-19 05:00:30

迭代地替换单词将成为实现中的最大瓶颈。在每次迭代中，您必须扫描整个字符串以查找停用词，然后替换操作必须分配一个新字符串并填充它替换后的文本。那不会很快。

一种更有效的方法是对字符串进行标记并以流式传输方式执行替换。将输入划分为各个单词，并用适当的空格或分隔符分隔。您可以逐步执行此操作，因此不需要分配任何额外的内存来执行此操作。对于每个单词（标记），您现在可以在停用词的哈希集中执行查找 - 如果找到匹配项，您将在将最终文本流式传输到单独的 StringBuilder 时替换它。如果令牌不是停用词，则只需将其不加修改地从 StringBuilder 中流出即可。这种方法应该具有 O(n) 性能，因为它只扫描字符串一次并使用 HashSet 执行停用词查找。

以下是我希望表现更好的一种方法。虽然它不是完全流式传输（它使用 String.Split() 分配了一组附加字符串），但它在一次传递中完成了所有处理。优化代码以避免分配额外的字符串可能不会提供太大的改进，因为您仍然需要提取子字符串来执行与停用词的比较。

下面的代码返回一个单词列表，其中排除所有停用词以及结果中两个字母或更短的单词。它还对停用词使用不区分大小写的比较。

public IEnumerable<string> SplitIntoWords( string input,
                                           IEnumerable<string> stopwords )
{
    // use case-insensitive comparison when matching stopwords
    var comparer = StringComparer.InvariantCultureIgnoreCase;
    var stopwordsSet = new HashSet<string>( stopwords, comparer );
    var splitOn = new char[] { ' ', '\t', '\r' ,'\n' };

    // if your splitting is more complicated, you could use RegEx instead...
    // if this becomes a bottleneck, you could use loop over the string using
    // string.IndexOf() - but you would still need to allocate an extra string
    // to perform comparison, so it's unclear if that would be better or not
    var words = input.Split( splitOn, StringSplitOptions.RemoveEmptyEntries );

    // return all words longer than 2 letters that are not stopwords...
    return words.Where( w => !stopwordsSet.Contains( w ) && w.Length > 2 );
}

Iteratively replacing words is going to be the biggest bottleneck in your implementation. On each iteration, you have to scan the entire string for the stopword, then the replace operation has to allocate a new string and populate it with the text post-replacement. That's not going to be fast.

A much more efficient approach is to tokenize the string and perform replacement in a streaming fashion. Divide the input into individual words separated by whatever whitespace or separator characters are appropriate. You can do this incrementally so you don't need to allocate any additional memory to do so. For each word (token), you can now perform a lookup in a hashset of stopwords - if you find match, you will replace it as you stream out the final text to a separate StringBuilder. If the token is not a stopword, just stream it out the StringBuilder unmodified. This approach should have O(n) performance, as it only scans the string once and uses a HashSet to perform stopword lookup.

Below is one approach that I would expect to perform better. While it isn't fully streaming (it uses String.Split() which allocated an array of additional strings), it does all of the processing in a single pass. Refining the code to avoid allocating additional string is probably not going to provide much of an improvement, since you still need to extract out substrings to perform comparisons to your stopwords.

Code below returns a list of words that excludes all stopwords and words two letters or shorter form the result. It uses case-insensitive comparison on the stopwords as well.

public IEnumerable<string> SplitIntoWords( string input,
                                           IEnumerable<string> stopwords )
{
    // use case-insensitive comparison when matching stopwords
    var comparer = StringComparer.InvariantCultureIgnoreCase;
    var stopwordsSet = new HashSet<string>( stopwords, comparer );
    var splitOn = new char[] { ' ', '\t', '\r' ,'\n' };

    // if your splitting is more complicated, you could use RegEx instead...
    // if this becomes a bottleneck, you could use loop over the string using
    // string.IndexOf() - but you would still need to allocate an extra string
    // to perform comparison, so it's unclear if that would be better or not
    var words = input.Split( splitOn, StringSplitOptions.RemoveEmptyEntries );

    // return all words longer than 2 letters that are not stopwords...
    return words.Where( w => !stopwordsSet.Contains( w ) && w.Length > 2 );
}

回复收藏 0 原文

霞映澄塘 2024-09-19 05:00:30

与其在循环中使用正则表达式替换，为什么不动态构造一个与任何一个停用词匹配的怪物匹配正则表达式，然后运行一个替换，将其替换为任何内容？如果您的停用词是“what”、“ok”和“yeah”，则类似于 "\b(what|ok|yeah)\b"。这看起来可能会更有效率。

回复收藏 0 原文

攒眉千度 2024-09-19 05:00:30

好吧，我知道 SO 不是一个纯粹的论坛，也许我不应该回答我自己的问题，但我想分享我的结果。

最后，感谢你们，我成功地优化了我的文本预处理。首先，我从我的问题中简化了长表达（遵循乔什·凯利的回答）：

[0-9]|[^\w]|(\b\w{1,2}\b)

它与第一个表达相同，但非常简单。然后再次按照 Josh Kelley 的建议，我将这个正则表达式放入汇编中。我在此处找到了将表达式编译为程序集的绝佳示例。我这样做是因为这个正则表达式被使用了很多很多次。在听了几篇有关编译正则表达式的文章后，这是我的决定。我在消除停用词后删除了最后一个表达式（这没有真正的意义）。

因此 12KiB 文本文件的执行时间约为 15 毫秒。这只是为了上面提到的表达。

最后一步是停用词。我决定对 3 个不同的选项进行测试（执行时间针对相同的 12KiB 文本文件）。

一个大的正则表达式，

包含所有停用词并编译成程序集（mquander 的建议）。这里没有什么需要澄清的。

执行时间：~215ms

String.Replace()

人们说这比 Regex 更快。因此，对于每个停用词，我使用了 string.Replace() 方法。结果需要进行许多循环：

执行时间：约 65 毫秒

LINQ

LBushkin 提出的方法。没什么可说的了。

执行时间：~2.5ms

我只能说哇。只需比较第一个和最后一个的执行时间即可！非常感谢LBushkin！

OK, I know that SO is not a pure forum and maybe I shouldn't answer my own question but I'd like to share with my results.

Finally, thanks to you guys, I managed to get better optimization of my text preprocessing. First of all I made simpler that long expression from my question (following Josh Kelley's answer):

[0-9]|[^\w]|(\b\w{1,2}\b)

It does the same as first one but is very simple. Then following Josh Kelley's suggestion again I put this regex into assembly. Great example of compiling expressions into assembly I found here. I did that, because this regex is used many, many times. After lecture of few articles about compiled regex, that was my decision. I removed the last expression after eliminating stop words (no real sense with that).

So the execution time on 12KiB text file was ~15ms. This is only for expression mentioned above.

Last step were stop words. I decided to make a test for 3 different options (Execution times are for the same 12KiB text file).

One big Regular Expression

with all stop words and compiled into assembly (mquander's suggestion). Nothing to clear here.

Execution time: ~215ms

String.Replace()

People say that this can be faster than Regex. So for each stop word I used string.Replace() method. Many loops to take with result:

Execution time: ~65ms

LINQ

method presented by LBushkin. Nothing to say more.

Execution time: ~2.5ms

I can only say wow. Just compare execution times of first one with the last one! Big thanks LBushkin!

回复收藏 0 原文

绝對不後悔。 2024-09-19 05:00:30

加速您的正则表达式

您的正则表达式可能需要一些工作。

例如，这一行：

result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " ");

使用括号捕获停用词以供以后使用，然后就永远不会使用它。也许 .NET 正则表达式引擎足够聪明，可以在这种情况下跳过捕获，也许不会。

这个正则表达式太复杂了：

"(([-]|[.]|[-.]|[0-9])?[0-9]*([.]|[,])*[0-9]+)|(\b\w{1,2}\b)|([^\w])"

"([-]|[.]|[-.]|[0-9])?" 与 "([-.0- 9]）？”。（除非您尝试将“-.”匹配为您的可能性之一？我假设现在不需要。）如果您不需要捕获（并且在您的示例中不需要），那么它与 <代码>“[-.0-9]？”。
"[-.0-9]?" 在 "[0-9]*" 之前有点多余。您可以进一步将其简化为“[-.]?[0-9]*”。
同样，如果您不需要捕获，则 "([.]|[,])*" 与 "[,.]*" 相同。

最后，测试是否编译你的正则表达式< /a> 可以产生更好的性能。

减少正则表达式和字符串操作

构造一堆字符串，组成一堆正则表达式对象，然后丢弃它们，就像您在这个循环中所做的那样，可能不是很快：

result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " ");

尝试 pre - 将停用词处理为一组正则表达式对象或创建一个巨大的预编译正则表达式（正如其他人所建议的那样）。

重构您的算法

您似乎只对处理词干、非停用词、文本感兴趣，而不是标点符号、数字等。

为此，您的算法采用以下方法：

词干所有文本（包括停用词？）。
使用正则表达式（不一定是最快的方法）用空格替换（这需要不断重新排列字符串的主体）非单词。
使用正则表达式（同样，不一定是最快的方法）将停用词（再次）替换为空格，一次一个停用词。

我开始在这里写另一种方法，但 LBushkin 抢先了我。按他说的做。请记住，作为一般规则，更改算法通常会比改进正则表达式使用等微观优化带来更大的改进。

Speed up your regexes

Your regexes could use some work.

For example, this line:

result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " ");

uses parentheses to capture the stopword for later use, then it never uses it. Maybe the .NET regex engine is smart enough to skip capturing in this case, maybe not.

This regex is way too complicated:

"(([-]|[.]|[-.]|[0-9])?[0-9]*([.]|[,])*[0-9]+)|(\b\w{1,2}\b)|([^\w])"

"([-]|[.]|[-.]|[0-9])?" is identical to "([-.0-9])?". (Unless you're trying to match "-." as one of your possibilities? I'll assume not for now.) If you don't need capturing (and you don't in your example), then it's identical to "[-.0-9]?".
"[-.0-9]?" is a bit redundant coming before "[0-9]*". You can further simplify it to "[-.]?[0-9]*".
Similarly, if you don't need capturing, then "([.]|[,])*" is identical to "[,.]*".

Finally, test if compiling your regexes can yield better performance.

Cut down on Regex and string manipulation

Constructing a bunch of strings, making up a bunch of Regex objects, and then discarding them, as you're doing in this loop, is probably not very fast:

result = Regex.Replace(result, "(\\b"+stopword+"\\b)", " ");

Try pre-processing stopwords into an array of Regex objects or creating one monster pre-compiled Regex (as others have suggested).

Restructure your algorithm

It looks like you're only interested in processing stemmed, non-stopword, text, and not punctuation, numbers, etc.

To do that, your algorithm takes the following approach:

Stem all text (including stopwords?).
Use regexes (not necessarily the fastest approach) to replace (which requires continually rearranging the string's body) non-words with whitespace.
Use regexes (again, not necessarily the fastest approach) to replace (again) stopwords with whitespace, one stopword at a time.

I started to write another approach here, but LBushkin beat me to it. Do what he says. Keep in mind, as a general rule, that changing your algorithm usually gives bigger improvements than micro-optimizations like improving your regex usage.

回复收藏 0 原文