如何从古腾堡计划文本中删除页眉/页脚?

发布于 2024-08-02 04:03:44 字数 189 浏览 19 评论 0原文

我尝试过各种方法来从古腾堡项目文本中剥离许可证,以用作语言学习项目的语料库,但我似乎无法想出一种无监督的可靠方法。 到目前为止,我想出的最好的启发式方法是删除前 28 行和最后 398 行,这对大量文本都有效。 关于如何自动删除文本的任何建议(对于许多文本来说非常相似,但每种情况下都有细微的差异,以及一些不同的模板),以及如何验证文本已准确剥离,将非常有用。

I've tried various methods to strip the license from Project Gutenberg texts, for use as a corpus for a language learning project, but I can't seem to come up with an unsupervised, reliable approach. The best heuristic I've come up with so far is stripping the first twenty eight lines and the last 398, which worked for a large number of the texts. Any suggestions as to ways I can automatically strip the text (which is very similar for lots of the texts, but with slight differences in each case, and a few different templates, as well), as well as suggestions for how to verify that the text has been stripped accurately, would be very useful.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

桃气十足 2024-08-09 04:03:44

多年来,我还想要一个工具来剥离古腾堡项目的页眉和页脚,以便进行自然语言处理,而不会因与 etxt 混合的样板文件而污染分析。 读完这个问题后,我终于抽出手指并编写了一个 Perl 过滤器,您可以将其通过管道传输到任何其他工具中。

它是使用每行正则表达式作为状态机制作的。 它的编写很容易理解,因为速度对于 etext 的典型大小来说不是问题。 到目前为止,它适用于我这里的几十个电子文本,但在野外肯定还有更多的变化需要添加。 希望代码足够清晰,任何人都可以添加:


#!/usr/bin/perl

# stripgutenberg.pl < in.txt > out.txt
#
# designed for piping
# Written by Andrew Dunbar (hippietrail), released into the public domain, Dec 2010

use strict;

my $debug = 0;

my $state = 'beginning';
my $print = 0;
my $printed = 0;

while (1) {
    $_ = <>;

    last unless $_;

    # strip UTF-8 BOM
    if ($. == 1 && index($_, "\xef\xbb\xbf") == 0) {
        $_ = substr($_, 3);
    }

    if ($state eq 'beginning') {
        if (/^(The Project Gutenberg [Ee]Book( of|,)|Project Gutenberg's )/) {
            $state = 'normal pg header';
            $debug && print "state: beginning -> normal pg header\n";
            $print = 0;
        } elsif (/^$/) {
            $state = 'beginning blanks';
            $debug && print "state: beginning -> beginning blanks\n";
        } else {
            die "unrecognized beginning: $_";
        }
    } elsif ($state eq 'normal pg header') {
        if (/^\*\*\*\ ?START OF TH(IS|E) PROJECT GUTENBERG EBOOK,? /) {
            $state = 'end of normal header';
            $debug && print "state: normal pg header -> end of normal pg header\n";
        } else {
            # body of normal pg header
        }
    } elsif ($state eq 'end of normal header') {
        if (/^(Produced by|Transcribed from)/) {
            $state = 'post header';
            $debug && print "state: end of normal pg header -> post header\n";
        } elsif (/^$/) {
            # blank lines
        } else {
            $state = 'etext body';
            $debug && print "state: end of normal header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'post header') {
        if (/^$/) {
            $state = 'blanks after post header';
            $debug && print "state: post header -> blanks after post header\n";
        } else {
            # multiline Produced / Transcribed
        }
    } elsif ($state eq 'blanks after post header') {
        if (/^$/) {
            # more blank lines
        } else {
            $state = 'etext body';
            $debug && print "state: blanks after post header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'beginning blanks') {
        if (/<!-- #INCLUDE virtual=\"\/include\/ga-books-texth\.html\" -->/) {
            $state = 'header include';
            $debug && print "state: beginning blanks -> header include\n";
        } elsif (/^Title: /) {
            $state = 'aus header';
            $debug && print "state: beginning blanks -> aus header\n";
        } elsif (/^$/) {
            # more blanks
        } else {
            die "unexpected stuff after beginning blanks: $_";
        }
    } elsif ($state eq 'header include') {
        if (/^$/) {
            # blanks after header include
        } else {
            $state = 'aus header';
            $debug && print "state: header include -> aus header\n";
        }
    } elsif ($state eq 'aus header') {
        if (/^To contact Project Gutenberg of Australia go to http:\/\/gutenberg\.net\.au$/) {
            $state = 'end of aus header';
            $debug && print "state: aus header -> end of aus header\n";
        } elsif (/^A Project Gutenberg of Australia eBook$/) {
            $state = 'end of aus header';
            $debug && print "state: aus header -> end of aus header\n";
        }
    } elsif ($state eq 'end of aus header') {
        if (/^((Title|Author): .*)?$/) {
            # title, author, or blank line
        } else {
            $state = 'etext body';
            $debug && print "state: end of aus header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'etext body') {
        # here's the stuff
        if (/^<!-- #INCLUDE virtual="\/include\/ga-books-textf\.html" -->$/) {
            $state = 'footer';
            $debug && print "state: etext body -> footer\n";
            $print = 0;
        } elsif (/^(\*\*\* ?)?end of (the )?project/i) {
            $state = 'footer';
            $debug && print "state: etext body -> footer\n";
            $print = 0;
        }
    } elsif ($state eq 'footer') {
        # nothing more of interest
    } else {
        die "unknown state '$state'";
    }

    if ($print) {
        print;
        ++$printed;
    } else {
        $debug && print "## $_";
    }
}

I've also wanted a tool to strip Project Gutenberg headers and footers for years for playing with natural language processing without contaminating the analysis with boilerplate mixed in with the etxt. After reading this question I finally pulled my finger out and wrote a Perl filter which you can pipe through into any other tool.

It's made as a state machine using per-line regexes. It's written to be easy to understand since speed is not an issue with the typical size of etexts. So far it works on the couple dozen etexts I have here but in the wild there are sure to be many more variations which need to be added. Hopefully the code is clear enough that anybody can add to it:


#!/usr/bin/perl

# stripgutenberg.pl < in.txt > out.txt
#
# designed for piping
# Written by Andrew Dunbar (hippietrail), released into the public domain, Dec 2010

use strict;

my $debug = 0;

my $state = 'beginning';
my $print = 0;
my $printed = 0;

while (1) {
    $_ = <>;

    last unless $_;

    # strip UTF-8 BOM
    if ($. == 1 && index($_, "\xef\xbb\xbf") == 0) {
        $_ = substr($_, 3);
    }

    if ($state eq 'beginning') {
        if (/^(The Project Gutenberg [Ee]Book( of|,)|Project Gutenberg's )/) {
            $state = 'normal pg header';
            $debug && print "state: beginning -> normal pg header\n";
            $print = 0;
        } elsif (/^$/) {
            $state = 'beginning blanks';
            $debug && print "state: beginning -> beginning blanks\n";
        } else {
            die "unrecognized beginning: $_";
        }
    } elsif ($state eq 'normal pg header') {
        if (/^\*\*\*\ ?START OF TH(IS|E) PROJECT GUTENBERG EBOOK,? /) {
            $state = 'end of normal header';
            $debug && print "state: normal pg header -> end of normal pg header\n";
        } else {
            # body of normal pg header
        }
    } elsif ($state eq 'end of normal header') {
        if (/^(Produced by|Transcribed from)/) {
            $state = 'post header';
            $debug && print "state: end of normal pg header -> post header\n";
        } elsif (/^$/) {
            # blank lines
        } else {
            $state = 'etext body';
            $debug && print "state: end of normal header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'post header') {
        if (/^$/) {
            $state = 'blanks after post header';
            $debug && print "state: post header -> blanks after post header\n";
        } else {
            # multiline Produced / Transcribed
        }
    } elsif ($state eq 'blanks after post header') {
        if (/^$/) {
            # more blank lines
        } else {
            $state = 'etext body';
            $debug && print "state: blanks after post header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'beginning blanks') {
        if (/<!-- #INCLUDE virtual=\"\/include\/ga-books-texth\.html\" -->/) {
            $state = 'header include';
            $debug && print "state: beginning blanks -> header include\n";
        } elsif (/^Title: /) {
            $state = 'aus header';
            $debug && print "state: beginning blanks -> aus header\n";
        } elsif (/^$/) {
            # more blanks
        } else {
            die "unexpected stuff after beginning blanks: $_";
        }
    } elsif ($state eq 'header include') {
        if (/^$/) {
            # blanks after header include
        } else {
            $state = 'aus header';
            $debug && print "state: header include -> aus header\n";
        }
    } elsif ($state eq 'aus header') {
        if (/^To contact Project Gutenberg of Australia go to http:\/\/gutenberg\.net\.au$/) {
            $state = 'end of aus header';
            $debug && print "state: aus header -> end of aus header\n";
        } elsif (/^A Project Gutenberg of Australia eBook$/) {
            $state = 'end of aus header';
            $debug && print "state: aus header -> end of aus header\n";
        }
    } elsif ($state eq 'end of aus header') {
        if (/^((Title|Author): .*)?$/) {
            # title, author, or blank line
        } else {
            $state = 'etext body';
            $debug && print "state: end of aus header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'etext body') {
        # here's the stuff
        if (/^<!-- #INCLUDE virtual="\/include\/ga-books-textf\.html" -->$/) {
            $state = 'footer';
            $debug && print "state: etext body -> footer\n";
            $print = 0;
        } elsif (/^(\*\*\* ?)?end of (the )?project/i) {
            $state = 'footer';
            $debug && print "state: etext body -> footer\n";
            $print = 0;
        }
    } elsif ($state eq 'footer') {
        # nothing more of interest
    } else {
        die "unknown state '$state'";
    }

    if ($print) {
        print;
        ++$printed;
    } else {
        $debug && print "## $_";
    }
}
萌能量女王 2024-08-09 04:03:44

你没有开玩笑。 就好像他们试图让这项工作由人工智能完成一样。 我只能想到两种方法,但都不是完美的。

1) 用 Perl 等语言建立一个脚本来解决最常见的模式(例如,查找短语“由……产生”,继续向下到下一个空行并在那里剪切),但放入大量关于什么是的断言预期的(例如下一个文本应该是标题或作者)。 这样,当模式失败时,您就会知道。 模式第一次失败时,请手动完成。 第二次,修改脚本。

2) 尝试亚马逊的 Mechanical Turk

You weren't kidding. It's almost as if they were trying to make the job AI-complete. I can think of only two approaches, neither of them perfect.

1) Set up a script in, say, Perl, to tackle the most common patterns (e.g., look for the phrase "produced by", keep going down to the next blank line and cut there) but put in lots of assertions about what's expected (e.g. the next text should be the title or author). That way when the pattern fails, you'll know it. The first time a pattern fails, do it by hand. The second time, modify the script.

2) Try Amazon's Mechanical Turk.

萌逼全场 2024-08-09 04:03:44

R 中的 gutenbergr 包似乎在删除标头方面做得很好,包括标头“官方”末尾之后的垃圾。

首先,您需要安装 R/Rstudio,然后

install.packages('gutenbergr')
library(gutenbergr)
t <- gutenberg_download('25519')  # give it the id number of the text

strip_headers arg 默认为 T。
您可能还想删除插图:

library(data.table)
t <- as.data.table(t)  # I hate tibbles -- datatables are easier to work with
head(t)  # get the column names

# filter out lines that are illustrations and joins all lines with a space
# the \\[ searches for the [ character, the \\ are used to 'escape' the special [ character
# the !like() means find rows where the text column is not like the search string
no_il <- t[!like(text, '\\[Illustration'), 'text']
# collapse the text into a single character string
t_cln <- do.call(paste, c(no_il, collapse = ' '))

还有 Python 中的 gutenberg 包,现在已存档且难以安装,以及 gutenberg_cleaner Python 包,但似乎工作得不太好。

The gutenbergr package in R seems to do an ok job of removing headers, including junk after the 'official' end of the header.

First you'll need to install R/Rstudio, then

install.packages('gutenbergr')
library(gutenbergr)
t <- gutenberg_download('25519')  # give it the id number of the text

The strip_headers arg is T by default.
You will also probably want to remove illustrations:

library(data.table)
t <- as.data.table(t)  # I hate tibbles -- datatables are easier to work with
head(t)  # get the column names

# filter out lines that are illustrations and joins all lines with a space
# the \\[ searches for the [ character, the \\ are used to 'escape' the special [ character
# the !like() means find rows where the text column is not like the search string
no_il <- t[!like(text, '\\[Illustration'), 'text']
# collapse the text into a single character string
t_cln <- do.call(paste, c(no_il, collapse = ' '))

There's also the gutenberg package in Python which is now archived and hard to install, as well as the gutenberg_cleaner Python package which doesn't seem to work that well.

婴鹅 2024-08-09 04:03:44

我还试图找出一种方法来清理古腾堡项目文本文件以进行文本分析,但我使用朱莉娅,我可能只是想重新发明轮子。 所以我想知道是否可以总结清理古腾堡项目文件的想法/规则,以便任何人都可以用任何语言实现它们,因为我发现了各种不错的程序,但在互联网上没有通用的解决方案。
到目前为止,我发现所有文本文件的末尾似乎都用类似于以下的标准行很好地标记了
“**** 古腾堡项目电子书结束......”。
然而,查找实际文本的开头的情况有所不同,似乎没有标准标记(在某些情况下根本没有标记行“*** ...”)。
然而,标题、作者等元数据是以标准方式编写的 - 例如:“标题:...”。 所以我正在尝试利用这些信息。
一种可能性是找到标题出现的最后一行(在前几十行内),在该标题之后是“真实文本”......我会尽力更新这个答案。

I am also trying to figure out a way to clean a Gutenberg Project text files for text analysis purpouses, but I use julia and I am probably just trying to reinvent the wheel. So I wonder if it is possible to summarize the ideas/rules to clean the Gutenberg Project files so that anyone can implement them in any language, because I found various nice programs but no general solution on internet.
So far, I found that the end of all text files seems to be well marked by a standard line similar to
"**** END OF GUTENBERG PROJECT EBOOK . . .".
However, the situations is different for finding the starting of the actual text, for which it seems there is no standard mark like (in some case there is no mark line "*** ..." at all).
However, metadata like title, authors, etc are written in a standard way - for example: "title: ...". So I am trying to exploit that information.
One possibility is to find the last line where the title appears (within the few first dozens of lines) and after that title there is the "real text"... I will try to keep this answer updated.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文