解决执行 Perl 脚本时出现内存不足错误

发布于 2024-12-15 15:47:33 字数 1665 浏览 0 评论 0原文

我正在尝试根据英语维基百科转储中找到的前 100K 个单词构建一个 n-gram 语言模型。我已经使用 Java 编写的修改后的 XML 解析器提取了纯文本，但需要将其转换为词汇文件。

为了做到这一点，我找到了一个 perl 脚本，据说可以完成这项工作，但缺乏如何执行的说明。不用说，我是 Perl 的新手，这是我第一次遇到使用它的需要。

当我运行此脚本时，在两台具有 4GB RAM 并运行 Ubuntu 10.04 和 10.10 的独立双核计算机上的 7.2GB 文本文件上使用此脚本时，出现内存不足错误。

当我联系作者时，他说这个脚本在 4GB RAM 的 MacBook Pro 上运行良好，当使用 perl 5.12 在 6.6GB 文本文件上执行时，总内存使用量约为 78 MB。作者还说该脚本逐行读取输入文件并在内存中创建一个哈希图。

脚本是：

#! /usr/bin/perl

use FindBin;
use lib "$FindBin::Bin";

use strict;
require 'english-utils.pl';

## Create a list of words and their frequencies from an input corpus document
## (format: plain text, words separated by spaces, no sentence separators)

## TODO should words with hyphens be expanded? (e.g. three-dimensional)

my %dict;
my $min_len = 3;
my $min_freq = 1;

while (<>) {

    chomp($_);
    my @words = split(" ", $_);

    foreach my $word (@words) {

        # Check validity against regexp and acceptable use of apostrophe

        if ((length($word) >= $min_len) && ($word =~ /^[A-Z][A-Z\'-]+$/) 
        && (index($word,"'") < 0 || allow_apostrophe($word))) {
            $dict{$word}++;
        }
    }

}

# Output words which occur with the $min_freq or more often

foreach my $dictword (keys %dict) {
    if ( $dict{$dictword} >= $min_freq ) {
        print $dictword . "\t" . $dict{$dictword} . "\n";
    }
}

我通过 mkvocab.pl corpus.txt 从命令行执行此脚本

包含的额外脚本只是一个正则表达式脚本，用于测试撇号的位置以及它们是否符合英语语法规则。

我认为内存泄漏是由于版本不同造成的，因为我的机器上安装的是 5.10。所以我升级到5.14，但错误仍然存在。根据 free -m，我的系统上大约有 1.5GB 可用内存。

由于我完全不熟悉语言的语法和结构，您能否指出问题所在以及问题存在的原因以及如何解决它。

原文

I'm attempting to build a n-gram language model based on the top 100K words found in the english language wikipedia dump. I've already extracted out the plain text with a modified XML parser written in Java, but need to convert it to a vocab file.

In order to do this, I found a perl script that is said to do the job, but lacks instructions on how to execute. Needless to say, I'm a complete newbie to Perl and this is the first time I've encountered a need for its usage.

When I run this script, I'm getting an Out of Memory Error when using this on a 7.2GB text file on two separate dual core machines with 4GB RAM and runnung Ubuntu 10.04 and 10.10.

When I contacted the author, he said this script ran fine on a MacBook Pro with 4GB RAM, and the total in-memory usage was about 78 MB when executed on a 6.6GB text file with perl 5.12. The author also said that the script reads the input file line by line and creates a hashmap in memory.

The script is:

#! /usr/bin/perl

use FindBin;
use lib "$FindBin::Bin";

use strict;
require 'english-utils.pl';

## Create a list of words and their frequencies from an input corpus document
## (format: plain text, words separated by spaces, no sentence separators)

## TODO should words with hyphens be expanded? (e.g. three-dimensional)

my %dict;
my $min_len = 3;
my $min_freq = 1;

while (<>) {

    chomp($_);
    my @words = split(" ", $_);

    foreach my $word (@words) {

        # Check validity against regexp and acceptable use of apostrophe

        if ((length($word) >= $min_len) && ($word =~ /^[A-Z][A-Z\'-]+$/) 
        && (index($word,"'") < 0 || allow_apostrophe($word))) {
            $dict{$word}++;
        }
    }

}

# Output words which occur with the $min_freq or more often

foreach my $dictword (keys %dict) {
    if ( $dict{$dictword} >= $min_freq ) {
        print $dictword . "\t" . $dict{$dictword} . "\n";
    }
}

I'm executing this script from the command line via mkvocab.pl corpus.txt

The included extra script is simply a regex script to test the placement of apostrophe's and whether they match English grammar rules.

I thought the memory leak was due to the different versions, as 5.10 was installed on my machine. So I upgraded to 5.14, but the error still persists. According to free -m, I have approximately 1.5GB free memory on my system.

As I am completely unfamiliar with the syntax and structure of language, can you point out the problem areas along with why the issue exists and how to fix it.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜吻♂芭芘 2024-12-22 15:47:34

如果单词中有一些重复，例如 the 出现 17,000 次等，则可以将 7.2Gb 文件加载到哈希中。不过，这似乎相当多。

您的脚本假定文件中的行适当长。如果您的文件不包含换行符，您将把整个文件加载到 $_ 中的内存中，然后使用 split 将内存负载加倍，然后添加相当多的内容更多地进入你的哈希。这会给任何系统带来压力。

一种想法可能是使用空格 " " 作为输入记录分隔符。它的作用与您已经使用 split 所做的差不多，只是它会保留其他空白字符，并且不会漂亮地修剪多余的空白。例如：

$/ = " ";
while (<>) {
    for my $word ( split ) {  # avoid e.g. "foo\nbar" being considered one word
        if (
              (length($word) >= $min_len) &&
              ($word =~ /^[A-Z][A-Z\'-]+$/) &&
              (index($word,"'") < 0 || allow_apostrophe($word))
        ) {
            $dict{$word}++;
        }
    }
}

假设单词之间有空格（而不是制表符或换行符），这将允许以一口大小的块读取非常长的行。

Loading a 7,2Gb file into a hash could be possible if there is some repetition in the words, e.g. the occurs 17,000 times, etc. It seems to be rather a lot, though.

Your script assumes that the lines in the file are appropriately long. If your file does not contain line breaks, you will load the whole file into memory in $_, then double that memory load with split, and then add quite a whole lot more into your hash. Which would strain any system.

One idea may be to use space " " as your input record separator. It will do approximately what you are already doing with split, except that it will leave other whitespace characters alone, and will not trim excess whitespace as prettily. For example:

$/ = " ";
while (<>) {
    for my $word ( split ) {  # avoid e.g. "foo\nbar" being considered one word
        if (
              (length($word) >= $min_len) &&
              ($word =~ /^[A-Z][A-Z\'-]+$/) &&
              (index($word,"'") < 0 || allow_apostrophe($word))
        ) {
            $dict{$word}++;
        }
    }
}

This will allow even very long lines to be read in bite size chunks, assuming you do have spaces between the words (and not tabs or newlines).

回复收藏 0 原文