如何从大型文本文件中删除停用词?

发布于 2024-09-30 04:43:43 字数 314 浏览 14 评论 0原文

我有十亿个单词语料库,我以标量形式收集了这些语料库。我有一个 .regex 文件,其中包含我想要从数据(文本)中删除的所有停用词。

我不知道如何使用这个 .regex 文件,所以我创建了一个数组并将 .regex 文件的所有停用词存储在我的停用词数组中。

要删除停用词,我会执行以下操作:

grep { $scalarText =~ s/\b\Q$_\E\b/ /g } @stopList;

这需要很长时间才能执行。如何在 Perl 脚本中使用 .regex 文件来删除停用词?或者有没有更快的方法来删除停用词?

I have a billion word corpus which I have collected in a scalar. I have a .regex file that contains all the stop words that I want to eliminate from my data (text).

I don't know how to use this .regex file, so I have made an array and stored all the stop words of the .regex file in my stop word array.

To remove the stop words I do something like this:

grep { $scalarText =~ s/\b\Q$_\E\b/ /g } @stopList;

This takes a long time to execute. How can I use the .regex file in my Perl script to remove the stop words? Or is there any faster way to remove the stop words?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

岁月静好 2024-10-07 04:43:43

是的,我想你在那里做的事情非常慢,尽管有几个原因。我认为在从语料库中构建包含十亿个单词的字符串之前,您需要先处理停用词正则表达式。

我不知道 .regex 文件是什么,但我假设它包含一个合法的 Perl 正则表达式,您可以使用以下内容进行编译:

$stopword_string = `cat foo.regex`;
$stopword_rx     = qr/$stopword_string/;

这可能假设有一个 (?x)< /code> 开头。

但是,如果您的停用词文件是行列表,则您将需要执行类似以下操作:

chomp(@stopwords = `cat foo.regex`);

# if each stopword is an independent regex:
$stopword_string = join "|" => @stopwords;

# else if each stopword is a literal
$stopword_string = join "|" => map {quotemeta} @stopwords;

# now compile it (maybe add some qr//OPTS)
$stopword_rx     = qr/\b(?:$stopword_string)\b/;

警告

非常小心\b:它只会做您想要的事情如果第一个单词中的第一个字符和最后一个单词中的最后一个字符是字母数字(\w 字符),则认为上面的情况会发生。否则,它会断言你可能无意的东西。如果有可能,您需要更具体。前导 \b 需要变为 (?:(?<=\A)|(?<=\s)),尾随 \b 需要变为 (?=\s|\z)。这就是大多数人认为 \b 的意思,但事实并非如此。

完成此操作后,您应该在阅读时将停用词正则表达式应用于语料库。最好的方法是首先将这些内容放入您的字符串中只需要稍后取出即可。

因此,与其这样做,不如

$corpus_text = `cat some-giant-file`;
$corpus_text =~ s/$stopword_rx//g;

这样做,

my $corpus_path = "/some/path/goes/here";
open(my $corpus_fh, "< :encoding(UTF-8)", $corpus_path)
    || die "$0: couldn't open $corpus_path: $!";

my $corpus_text = q##;

while (<$corpus_fh>) {
    chomp;  # or not
    $corpus_text .= $_ unless /$stopword_rx/;
}

close($corpus_fh)
    || die "$0: couldn't close $corpus_path: $!";

这比把东西放在那里要快得多,你只需要稍后再清除掉。

我上面使用的cat只是一个快捷方式。我不希望您实际调用一个程序,尤其是 cat,只是为了读取未经处理且不受干扰的单个文件。 ☺

Yes, I imagine what you're doing there is extremely slow, albeit for a couple of reasons. I think you need to process your stopwords regex before you build up your string of a billion words from your corpus.

I have no idea what a .regex file is, but I'm going to presume it contains a legal Perl regular expression, something that you can compile using no more than:

$stopword_string = `cat foo.regex`;
$stopword_rx     = qr/$stopword_string/;

That probably presumes that there's a (?x) at the start.

But if your stopword file is a list of lines, you will need to do something more like this:

chomp(@stopwords = `cat foo.regex`);

# if each stopword is an independent regex:
$stopword_string = join "|" => @stopwords;

# else if each stopword is a literal
$stopword_string = join "|" => map {quotemeta} @stopwords;

# now compile it (maybe add some qr//OPTS)
$stopword_rx     = qr/\b(?:$stopword_string)\b/;

WARNING

Be very careful with \b: it's only going to do what you think it does above if the first character in the first word and the last character in the last word is an alphanumunder (a \w character). Otherwise, it will be asserting something you probably don't mean. If that could be a possibility, you will need to be more specific. The leading \b would need to become (?:(?<=\A)|(?<=\s)), and the trailing \b would need to become (?=\s|\z). That's what most people think \b means, but it really doesn't.

Having done that, you should apply the stopword regex to the corpus as you're reading it in. The best way to do this is not to put the stuff into your string in the first place that you'll just need to take out later.

So instead of doing

$corpus_text = `cat some-giant-file`;
$corpus_text =~ s/$stopword_rx//g;

Instead do

my $corpus_path = "/some/path/goes/here";
open(my $corpus_fh, "< :encoding(UTF-8)", $corpus_path)
    || die "$0: couldn't open $corpus_path: $!";

my $corpus_text = q##;

while (<$corpus_fh>) {
    chomp;  # or not
    $corpus_text .= $_ unless /$stopword_rx/;
}

close($corpus_fh)
    || die "$0: couldn't close $corpus_path: $!";

That will be much faster than putting stuff in there that you just have to weed out again later.

My use of cat above is just a shortcut. I don't expect you to actually call a program, least of all cat, just to read in a single file, unprocessed and unmolested. ☺

梦开始←不甜 2024-10-07 04:43:43

您可能需要使用 Regexp::Assemble 来编译将 Perl 正则表达式列表合并为一个正则表达式。

You may want to use Regexp::Assemble to compile a list of Perl regexes into one regex.

萌能量女王 2024-10-07 04:43:43

我找到了一种更快的方法。为我节省了大约 4 秒。

my $qrstring = '\b(' . (join '|', @stopList) . ')\b';
$scalarText =~ s/$qrstring/ /g;

其中 stopList 是我所有单词的数组
scalarText 是我的整个文本。

如果您知道的话,有人可以告诉我更快的方法吗?

I found a faster way to do it. Saves me around 4 seconds.

my $qrstring = '\b(' . (join '|', @stopList) . ')\b';
$scalarText =~ s/$qrstring/ /g;

where stopList is the array of all my words
and scalarText is my whole text.

Can anyone please tell me a faster way if you know any?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文