Perl 脚本中的大小写敏感性 - 如何使其不敏感?

发布于 2024-08-24 03:25:59 字数 1009 浏览 14 评论 0原文

我如何更改以下马尔可夫脚本以将大写和小写单词视为相同?

整个想法是帮助提高马尔可夫文本生成器的输出质量。

就目前情况而言,如果您在其中插入 99 个小写句子和 1 个大写句子 - 您几乎总是会在输出中找到大写句子的非标记化版本。

# Copyright (C) 1999 Lucent Technologies
# Excerpted from 'The Practice of Programming'
# by Brian W. Kernighan and Rob Pike

# markov.pl: markov chain algorithm for 2-word prefixes

$MAXGEN = 10000;
$NONWORD = "\n";
$w1 = $w2 = $NONWORD;                    # initial state
while (<>)
{                                        # read each line of input
    foreach (split)
    {
      push(@{$statetab{$w1}{$w2}}, $_);
      ($w1, $w2) = ($w2, $_);        # multiple assignment
    }
}

push(@{$statetab{$w1}{$w2}}, $NONWORD);  # add tail
$w1 = $w2 = $NONWORD;

for ($i = 0; $i < $MAXGEN; $i++) 
{
    $suf = $statetab{$w1}{$w2};      # array reference
    $r = int(rand @$suf);            # @$suf is number of elems
    exit if (($t = $suf->[$r]) eq $NONWORD);
    print "$t\n";
    ($w1, $w2) = ($w2, $t);          # advance chain
}

How would I change the following markov script to treat capitalized and lowercase words as the same?

The entire idea is to help increase the quality of output of my markov text generator.

As it stands, if you plug 99 lowercase sentences into it and 1 capitalized sentence - you almost always find a non-markovized version of the capitalized sentence in the output.

# Copyright (C) 1999 Lucent Technologies
# Excerpted from 'The Practice of Programming'
# by Brian W. Kernighan and Rob Pike

# markov.pl: markov chain algorithm for 2-word prefixes

$MAXGEN = 10000;
$NONWORD = "\n";
$w1 = $w2 = $NONWORD;                    # initial state
while (<>)
{                                        # read each line of input
    foreach (split)
    {
      push(@{$statetab{$w1}{$w2}}, $_);
      ($w1, $w2) = ($w2, $_);        # multiple assignment
    }
}

push(@{$statetab{$w1}{$w2}}, $NONWORD);  # add tail
$w1 = $w2 = $NONWORD;

for ($i = 0; $i < $MAXGEN; $i++) 
{
    $suf = $statetab{$w1}{$w2};      # array reference
    $r = int(rand @$suf);            # @$suf is number of elems
    exit if (($t = $suf->[$r]) eq $NONWORD);
    print "$t\n";
    ($w1, $w2) = ($w2, $t);          # advance chain
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

夏尔 2024-08-31 03:25:59

Nathan Fellman 和 mobrule 都建议采用一种常见做法:标准化

在进行作为程序或子例程的主要目标的实际计算之前,处理数据通常更简单,使其符合内容和结构的预期规范。

马尔可夫链程序很有趣,所以我决定尝试一下。

这是一个允许您控制马尔可夫链中的层数的版本。通过更改$DEPTH,您可以调整模拟的顺序。

我将代码分解为可重用的子例程。您可以通过更改规范化例程来修改规范化规则。您还可以根据一组定义的值生成一条链。

生成多层状态表的代码是最有趣的部分。我本来可以使用 Data::Diver,但我想自己解决。

单词标准化代码确实应该允许标准化器返回要处理的单词列表,而不仅仅是单个单词 - 但我不想修复它现在可以返回单词列表。诸如序列化处理后的语料库之类的其他事情会很好,并且使用 Getopt::Long 进行命令行开关仍有待完成。我只做了有趣的部分。

在不使用对象的情况下编写此内容对我来说是一个挑战 - 这确实感觉像是制作马尔可夫生成器对象的好地方。我喜欢物体。但是,我决定保持代码的程序性,以便保留原始代码的精神。

玩得开心。

#!/usr/bin/perl
use strict;
use warnings;

use IO::Handle;

use constant NONWORD => "-";
my $MAXGEN = 10000;
my $DEPTH  = 2;

my %state_table;

process_corpus( \*ARGV, $DEPTH, \%state_table );
generate_markov_chain( \%state_table, $MAXGEN );


sub process_corpus {
    my $fh    = shift;
    my $depth = shift;
    my $state_table = shift || {};;

    my @history = (NONWORD) x $depth;


    while( my $raw_line = $fh->getline ) {

        my $line = normalize_line($raw_line);
        next unless defined $line;

        my @words = map normalize_word($_), split /\s+/, $line;
        for my $word ( @words ) {

            next unless defined $word; 

            add_word_to_table( $state_table, \@history, $word );
            push  @history, $word;
            shift @history;
        }

    }

    add_word_to_table( $state_table, \@history, NONWORD );

    return $state_table;
}

# This was the trickiest to write.
# $node has to be a reference to the slot so that 
# autovivified items will be retained in the $table.
sub add_word_to_table {
    my $table   = shift;
    my $history = shift;
    my $word    = shift;

    my $node = \$table;

    for( @$history ) {
        $node = \${$node}->{$_};
    }

    push @$node, $word;

    return 1;
}

# Replace this with anything.
# Return undef to skip a word
sub normalize_word {
    my $word = shift;
    $word =~ s/[^A-Z]//g;
    return length $word ? $word : ();
}

# Replace this with anything.
# Return undef to skip a line
sub normalize_line {
    return uc shift;
}


sub generate_markov_chain {
    my $table   = shift;
    my $length  = shift;
    my $history = shift || [];

    my $node = $table;

    unless( @$history ) {

        while( 
            ref $node eq ref {}
                and
            exists $node->{NONWORD()} 
        ) {
            $node = $node->{NONWORD()};
            push @$history, NONWORD;
        }

    }

    for (my $i = 0; $i < $MAXGEN; $i++) {

        my $word = get_word( $table, $history );

        last if $word eq NONWORD;
        print "$word\n";

        push @$history, $word;
        shift @$history;
    }

    return $history;
}


sub get_word {
    my $table   = shift;
    my $history = shift;

    for my $step ( @$history ) {
        $table = $table->{$step};
    }

    my $word = $table->[ int rand @$table ];
    return $word;
}

更新:
我修复了上面的代码以处理从 normalize_word() 例程返回的多个单词。

要保持大小写完整并将标点符号视为单词,请替换 normalize_line()normalize_word()

sub normalize_line {
    return shift;
}

sub normalize_word {
    my $word = shift;

    # Sanitize words to only include letters and ?,.! marks 
    $word =~ s/[^A-Z?.,!]//gi;

    # Break the word into multiple words as needed.
    my @words = split /([.?,!])/, $word;

    # return all non-zero length words. 
    return grep length, @words;
}

另一个潜在的大问题是我使用了 - 作为非字字符。如果要包含连字符作为标点符号,则需要更改第 8 行的 NONWORD 常量定义。只需选择永远不能是单词的内容即可。

Nathan Fellman and mobrule are both suggesting a common practice: Normalization.

It's often simpler to process data so that it conforms to expected norms of content and structure, before doing the actual computation that is the main goal of the program or subroutine.

The Markov chain program was interesting, so I decided to play with it.

Here's a version that allows you to control the number of layers in the Markov chain. By changing $DEPTH you can adjust the order of the simulation.

I broke the code into reusable subroutines. You can modify the normalization rules by changing the normalization routines. You can also generate a chain based on a defined set of values.

The code to generate the multi-layer state table was the most interesting bit. I could have used Data::Diver, but I wanted to work it out myself.

The word normalization code really should allow the normalizer to return a list of words to process, rather than just a single word--but I don't feel like fixing it now can return a list of words.. Other things like serializing your processed corpus would be good, and using Getopt::Long for command line switches remain to do. I only did the fun bits.

It was a bit of a challenge for me to write this without using objects--this really felt like a good place to make a Markov generator object. I like objects. But, I decided to keep the code procedural so it would retain the spirit of the original.

Have fun.

#!/usr/bin/perl
use strict;
use warnings;

use IO::Handle;

use constant NONWORD => "-";
my $MAXGEN = 10000;
my $DEPTH  = 2;

my %state_table;

process_corpus( \*ARGV, $DEPTH, \%state_table );
generate_markov_chain( \%state_table, $MAXGEN );


sub process_corpus {
    my $fh    = shift;
    my $depth = shift;
    my $state_table = shift || {};;

    my @history = (NONWORD) x $depth;


    while( my $raw_line = $fh->getline ) {

        my $line = normalize_line($raw_line);
        next unless defined $line;

        my @words = map normalize_word($_), split /\s+/, $line;
        for my $word ( @words ) {

            next unless defined $word; 

            add_word_to_table( $state_table, \@history, $word );
            push  @history, $word;
            shift @history;
        }

    }

    add_word_to_table( $state_table, \@history, NONWORD );

    return $state_table;
}

# This was the trickiest to write.
# $node has to be a reference to the slot so that 
# autovivified items will be retained in the $table.
sub add_word_to_table {
    my $table   = shift;
    my $history = shift;
    my $word    = shift;

    my $node = \$table;

    for( @$history ) {
        $node = \${$node}->{$_};
    }

    push @$node, $word;

    return 1;
}

# Replace this with anything.
# Return undef to skip a word
sub normalize_word {
    my $word = shift;
    $word =~ s/[^A-Z]//g;
    return length $word ? $word : ();
}

# Replace this with anything.
# Return undef to skip a line
sub normalize_line {
    return uc shift;
}


sub generate_markov_chain {
    my $table   = shift;
    my $length  = shift;
    my $history = shift || [];

    my $node = $table;

    unless( @$history ) {

        while( 
            ref $node eq ref {}
                and
            exists $node->{NONWORD()} 
        ) {
            $node = $node->{NONWORD()};
            push @$history, NONWORD;
        }

    }

    for (my $i = 0; $i < $MAXGEN; $i++) {

        my $word = get_word( $table, $history );

        last if $word eq NONWORD;
        print "$word\n";

        push @$history, $word;
        shift @$history;
    }

    return $history;
}


sub get_word {
    my $table   = shift;
    my $history = shift;

    for my $step ( @$history ) {
        $table = $table->{$step};
    }

    my $word = $table->[ int rand @$table ];
    return $word;
}

Update:
I fixed the above code to handle multiple words coming back from the normalize_word() routine.

To leave case intact and treat punctuation symbols as words, replace normalize_line() and normalize_word():

sub normalize_line {
    return shift;
}

sub normalize_word {
    my $word = shift;

    # Sanitize words to only include letters and ?,.! marks 
    $word =~ s/[^A-Z?.,!]//gi;

    # Break the word into multiple words as needed.
    my @words = split /([.?,!])/, $word;

    # return all non-zero length words. 
    return grep length, @words;
}

The other big lurking gotcha is that I used - as the NONWORD character. If you want to include a hyphen as a punctuation symbol, you will need to change the NONWORD constant definition at line 8. Just choose something that can never be a word.

云胡 2024-08-31 03:25:59

在处理之前将所有输入转换为小写?

请参阅lc 函数

Convert all your input to lowercase before processing it?

See the lc function.

岛歌少女 2024-08-31 03:25:59

我认为最好的选择是输入单词后立即将其小写(或大写):

while (<>)
{                                        # read each line of input
    lc; # convert $_ to lowercase
    # etc.
}

I think the best bet would be to lowercase (or uppercase) the words as soon as they're input:

while (<>)
{                                        # read each line of input
    lc; # convert $_ to lowercase
    # etc.
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文