如何使用 Perl 计算文件中的字符、单词和行数?

发布于 2024-07-17 09:42:29 字数 50 浏览 9 评论 0 原文

使用 Perl(不使用 wc)计算文本文件的字符数、单词数和行数的好/最佳方法是什么?

What is a good/best way to count the number of characters, words, and lines of a text file using Perl (without using wc)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

月隐月明月朦胧 2024-07-24 09:42:29

这是 Perl 代码。 计算单词数可能有些主观,但我只是说它是不是空格的任何字符串。

open(FILE, "<file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);

while (<FILE>) {
    $lines++;
    $chars += length($_);
    $words += scalar(split(/\s+/, $_));
}

print("lines=$lines words=$words chars=$chars\n");

Here's the perl code. Counting words can be somewhat subjective, but I just say it's any string of characters that isn't whitespace.

open(FILE, "<file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);

while (<FILE>) {
    $lines++;
    $chars += length($_);
    $words += scalar(split(/\s+/, $_));
}

print("lines=$lines words=$words chars=$chars\n");
执笔绘流年 2024-07-24 09:42:29

bmdhacks 答案的一个变体可能会产生更好的结果,那就是使用 \s+ (甚至更好的 \W+)作为分隔符。 考虑字符串“The  quick  brown Fox”(如果不明显则附加空格)。 使用单个空白字符的分隔符将得到六个而不是四个字数。 因此,请尝试:

open(FILE, "<file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);

while (<FILE>) {
    $lines++;
    $chars += length($_);
    $words += scalar(split(/\W+/, $_));
}

print("lines=$lines words=$words chars=$chars\n");

使用 \W+ 作为分隔符将阻止标点符号(除其他外)计为单词。

A variation on bmdhacks' answer that will probably produce better results is to use \s+ (or even better \W+) as the delimiter. Consider the string "The  quick  brown fox" (additional spaces if it's not obvious). Using a delimiter of a single whitespace character will give a word count of six not four. So, try:

open(FILE, "<file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);

while (<FILE>) {
    $lines++;
    $chars += length($_);
    $words += scalar(split(/\W+/, $_));
}

print("lines=$lines words=$words chars=$chars\n");

Using \W+ as the delimiter will stop punctuation (amongst other things) from counting as words.

も让我眼熟你 2024-07-24 09:42:29

字数统计工具计算文本文件中的字符、单词和行数

The Word Count tool counts characters, words and lines in text files

漫雪独思 2024-07-24 09:42:29

这里。 试试这个 Unicode 版本的 wc 程序。

  • 它跳过非文件参数(管道、目录、套接字等)。

  • 它假定 UTF-8 文本。

  • 它将任何 Unicode 空格视为单词分隔符。

  • 如果文件名末尾有 .ENCODING,它还接受备用编码,例如 foo.cp1252foo.latin1foo.utf16 等。

  • 它还可以处理以各种格式压缩的文件。

  • 它给出了段落、行、单词、字素、字符字节的计数。

  • 它理解所有 Unicode 换行序列。

  • 它会警告文本文件损坏并出现换行错误。

这是运行它的示例:

   Paras    Lines    Words   Graphs    Chars    Bytes File
       2     2270    82249   504169   504333   528663 /tmp/ap
       1     2404    11163    63164    63164    66336 /tmp/b3
    uwc: missing linebreak at end of corrupted textfiile /tmp/bad
      1*       2*        4       19       19       19 /tmp/bad
       1       14       52      273      273      293 /tmp/es
      57      383     1369    11997    11997    12001 /tmp/funny
       1   657068  3175429 31205970 31209138 32633834 /tmp/lw
       1        1        4       27       27       27 /tmp/nf.cp1252
       1        1        4       27       27       34 /tmp/nf.euc-jp
       1        1        4       27       27       27 /tmp/nf.latin1
       1        1        4       27       27       27 /tmp/nf.macroman
       1        1        4       27       27       54 /tmp/nf.ucs2
       1        1        4       27       27       56 /tmp/nf.utf16
       1        1        4       27       27       54 /tmp/nf.utf16be
       1        1        4       27       27       54 /tmp/nf.utf16le
       1        1        4       27       27      112 /tmp/nf.utf32
       1        1        4       27       27      108 /tmp/nf.utf32be
       1        1        4       27       27      108 /tmp/nf.utf32le
       1        1        4       27       27       39 /tmp/nf.utf7
       1        1        4       27       27       31 /tmp/nf.utf8
       1    26906   101528   635841   636026   661202 /tmp/o2
     131      346     1370     9590     9590     4486 /tmp/perl5122delta.pod.gz
     291      814     3941    25318    25318     9878 /tmp/perl51310delta.pod.bz2
       1     2551     5345   132655   132655   133178 /tmp/tailsort-pl.utf8
       1       89      334     1784     1784     2094 /tmp/til
       1        4       18       88       88      106 /tmp/w
     276     1736     5773    53782    53782    53804 /tmp/www

开始吧:

#!/usr/bin/env perl 
#########################################################################
# uniwc - improved version of wc that works correctly with Unicode
#
# Tom Christiansen <[email protected]>
# Mon Feb 28 15:59:01 MST 2011
#########################################################################

use 5.10.0;

use strict;
use warnings FATAL => "all";
use sigtrap qw[ die untrapped normal-signals ];

use Carp;

$SIG{__WARN__}  = sub {
    confess("FATALIZED WARNING: @_")  unless $^S;
};

$SIG{__DIE__}  = sub {
    confess("UNCAUGHT EXCEPTION: @_")  unless $^S;
};

$| = 1;

my $Errors = 0;
my $Headers = 0;

sub yuck($) {
    my $errmsg = $_[0];
    $errmsg =~ s/(?<=[^\n])\z/\n/;
    print STDERR "$0: $errmsg";
}

process_input(\&countem);

sub countem { 
    my ($_, $file) = @_;

    my (
        @paras, @lines, @words,
        $paracount, $linecount, $wordcount, 
        $grafcount, $charcount, $bytecount,
    );

    if ($charcount = length($_)) {
        $wordcount = eval { @words = split m{ \p{Space}+  }x }; 
        yuck "error splitting words: $@" if $@;

        $linecount = eval { @lines = split m{ \R     }x }; 
        yuck "error splitting lines: $@" if $@;

        $grafcount = 0;
        $grafcount++ while /\X/g;
        #$grafcount = eval { @lines = split m{ \R     }x }; 
        yuck "error splitting lines: $@" if $@;

        $paracount = eval { @paras = split m{ \R{2,} }x }; 
        yuck "error splitting paras: $@" if $@;

        if ($linecount && !/\R\z/) {
            yuck("missing linebreak at end of corrupted textfiile $file");
            $linecount .= "*";
            $paracount .= "*";
        } 
    }

    $bytecount = tell;
    if (-e $file) {
        $bytecount = -s $file;
        if ($bytecount != -s $file) {
            yuck "filesize of $file differs from bytecount\n";
            $Errors++;
        }
    } 
    my $mask = "%8s " x 6 . "%s\n";
    printf  $mask => qw{ Paras Lines Words Graphs Chars Bytes File } unless $Headers++;

    printf $mask => map( { show_undef($_) } 
                                $paracount, $linecount, 
                                $wordcount, $grafcount, 
                                $charcount, $bytecount,
                       ), $file;
} 

sub show_undef {
    my $value = shift;
    return defined($value)
             ? $value
             : "undef";
} 

END { 
    close(STDOUT) || die "$0: can't close STDOUT: $!";
    exit($Errors != 0);
}

sub process_input {

    my $function = shift();

    my $enc;

    if (@ARGV == 0 && -t) {
        warn "$0: reading from stdin, type ^D to end or ^C to kill.\n";
    }

    unshift(@ARGV, "-") if @ARGV == 0;

FILE:

    for my $file (@ARGV) {
        # don't let magic open make an output handle

        next if -e $file && ! -f _;

        my $quasi_filename = fix_extension($file);

        $file = "standard input" if $file eq q(-);
        $quasi_filename =~ s/^(?=\s*[>|])/< /;

        no strict "refs";
        my $fh = $file;   # is *so* a lexical filehandle! ☺
        unless (open($fh, $quasi_filename)) {
            yuck("couldn't open $quasi_filename: $!");
            next FILE;
        }
        set_encoding($fh, $file) || next FILE;

        my $whole_file = eval {
            use warnings "FATAL" => "all";
            local $/;
            scalar <$fh>;
        };

        if ($@) {
            $@ =~ s/ at \K.*? line \d+.*/$file line $./;
            yuck($@);
            next FILE;
        }

        $function->($whole_file, $file);

        unless (close $fh) {
            yuck("couldn't close $quasi_filename at line $.: $!");
            next FILE;
        }

    } # foreach file

}

sub set_encoding(*$) {
    my ($handle, $path) = @_;

    my $enc_name = "utf8";

    if ($path && $path =~ m{ \. ([^\s.]+) \z }x) {
        my $ext = $1;
        die unless defined $ext;
        require Encode;
        if (my $enc_obj = Encode::find_encoding($ext)) {
            my $name = $enc_obj->name || $ext;
            $enc_name = "encoding($name)";
        }
    }

    return 1 if eval {
        use warnings FATAL => "all";
        no strict "refs";
        binmode($handle, ":$enc_name");
        1;
    };

    for ($@) {
        s/ at .* line \d+\.//;
        s/$/ for $path/;
    }

    yuck("set_encoding: $@");

    return undef;
}

sub fix_extension {
    my $path = shift();
    my %Compress = (
        Z       =>  "zcat",
        z       => "gzcat",            # for uncompressing
        gz      => "gzcat",
        bz      => "bzcat",
        bz2     => "bzcat",
        bzip    => "bzcat",
        bzip2   => "bzcat",
        lzma    => "lzcat",
    );

    if ($path =~ m{ \. ( [^.\s] +) \z }x) {
        if (my $prog = $Compress{$1}) {
            return "$prog $path |";
        } 
    } 

    return $path;

}

Here. Try this Unicode-savvy version of the wc program.

  • It skips non-file arguments (pipes, directories, sockets, etc).

  • It assumes UTF-8 text.

  • It counts any Unicode whitespace as a word separator.

  • It also accepts alternate encodings if there is a .ENCODING at the end of the filename, like foo.cp1252, foo.latin1, foo.utf16, etc.

  • It also work with files that have been compressed in a variety of formats.

  • It gives counts of Paragraphs, Lines, Words, Graphemes, Characters, and Bytes.

  • It understands all Unicode linebreak sequences.

  • It warns about corrupted textfiles with linebreak errors.

Here’s an example of running it:

   Paras    Lines    Words   Graphs    Chars    Bytes File
       2     2270    82249   504169   504333   528663 /tmp/ap
       1     2404    11163    63164    63164    66336 /tmp/b3
    uwc: missing linebreak at end of corrupted textfiile /tmp/bad
      1*       2*        4       19       19       19 /tmp/bad
       1       14       52      273      273      293 /tmp/es
      57      383     1369    11997    11997    12001 /tmp/funny
       1   657068  3175429 31205970 31209138 32633834 /tmp/lw
       1        1        4       27       27       27 /tmp/nf.cp1252
       1        1        4       27       27       34 /tmp/nf.euc-jp
       1        1        4       27       27       27 /tmp/nf.latin1
       1        1        4       27       27       27 /tmp/nf.macroman
       1        1        4       27       27       54 /tmp/nf.ucs2
       1        1        4       27       27       56 /tmp/nf.utf16
       1        1        4       27       27       54 /tmp/nf.utf16be
       1        1        4       27       27       54 /tmp/nf.utf16le
       1        1        4       27       27      112 /tmp/nf.utf32
       1        1        4       27       27      108 /tmp/nf.utf32be
       1        1        4       27       27      108 /tmp/nf.utf32le
       1        1        4       27       27       39 /tmp/nf.utf7
       1        1        4       27       27       31 /tmp/nf.utf8
       1    26906   101528   635841   636026   661202 /tmp/o2
     131      346     1370     9590     9590     4486 /tmp/perl5122delta.pod.gz
     291      814     3941    25318    25318     9878 /tmp/perl51310delta.pod.bz2
       1     2551     5345   132655   132655   133178 /tmp/tailsort-pl.utf8
       1       89      334     1784     1784     2094 /tmp/til
       1        4       18       88       88      106 /tmp/w
     276     1736     5773    53782    53782    53804 /tmp/www

Here ya go:

#!/usr/bin/env perl 
#########################################################################
# uniwc - improved version of wc that works correctly with Unicode
#
# Tom Christiansen <[email protected]>
# Mon Feb 28 15:59:01 MST 2011
#########################################################################

use 5.10.0;

use strict;
use warnings FATAL => "all";
use sigtrap qw[ die untrapped normal-signals ];

use Carp;

$SIG{__WARN__}  = sub {
    confess("FATALIZED WARNING: @_")  unless $^S;
};

$SIG{__DIE__}  = sub {
    confess("UNCAUGHT EXCEPTION: @_")  unless $^S;
};

$| = 1;

my $Errors = 0;
my $Headers = 0;

sub yuck($) {
    my $errmsg = $_[0];
    $errmsg =~ s/(?<=[^\n])\z/\n/;
    print STDERR "$0: $errmsg";
}

process_input(\&countem);

sub countem { 
    my ($_, $file) = @_;

    my (
        @paras, @lines, @words,
        $paracount, $linecount, $wordcount, 
        $grafcount, $charcount, $bytecount,
    );

    if ($charcount = length($_)) {
        $wordcount = eval { @words = split m{ \p{Space}+  }x }; 
        yuck "error splitting words: $@" if $@;

        $linecount = eval { @lines = split m{ \R     }x }; 
        yuck "error splitting lines: $@" if $@;

        $grafcount = 0;
        $grafcount++ while /\X/g;
        #$grafcount = eval { @lines = split m{ \R     }x }; 
        yuck "error splitting lines: $@" if $@;

        $paracount = eval { @paras = split m{ \R{2,} }x }; 
        yuck "error splitting paras: $@" if $@;

        if ($linecount && !/\R\z/) {
            yuck("missing linebreak at end of corrupted textfiile $file");
            $linecount .= "*";
            $paracount .= "*";
        } 
    }

    $bytecount = tell;
    if (-e $file) {
        $bytecount = -s $file;
        if ($bytecount != -s $file) {
            yuck "filesize of $file differs from bytecount\n";
            $Errors++;
        }
    } 
    my $mask = "%8s " x 6 . "%s\n";
    printf  $mask => qw{ Paras Lines Words Graphs Chars Bytes File } unless $Headers++;

    printf $mask => map( { show_undef($_) } 
                                $paracount, $linecount, 
                                $wordcount, $grafcount, 
                                $charcount, $bytecount,
                       ), $file;
} 

sub show_undef {
    my $value = shift;
    return defined($value)
             ? $value
             : "undef";
} 

END { 
    close(STDOUT) || die "$0: can't close STDOUT: $!";
    exit($Errors != 0);
}

sub process_input {

    my $function = shift();

    my $enc;

    if (@ARGV == 0 && -t) {
        warn "$0: reading from stdin, type ^D to end or ^C to kill.\n";
    }

    unshift(@ARGV, "-") if @ARGV == 0;

FILE:

    for my $file (@ARGV) {
        # don't let magic open make an output handle

        next if -e $file && ! -f _;

        my $quasi_filename = fix_extension($file);

        $file = "standard input" if $file eq q(-);
        $quasi_filename =~ s/^(?=\s*[>|])/< /;

        no strict "refs";
        my $fh = $file;   # is *so* a lexical filehandle! ☺
        unless (open($fh, $quasi_filename)) {
            yuck("couldn't open $quasi_filename: $!");
            next FILE;
        }
        set_encoding($fh, $file) || next FILE;

        my $whole_file = eval {
            use warnings "FATAL" => "all";
            local $/;
            scalar <$fh>;
        };

        if ($@) {
            $@ =~ s/ at \K.*? line \d+.*/$file line $./;
            yuck($@);
            next FILE;
        }

        $function->($whole_file, $file);

        unless (close $fh) {
            yuck("couldn't close $quasi_filename at line $.: $!");
            next FILE;
        }

    } # foreach file

}

sub set_encoding(*$) {
    my ($handle, $path) = @_;

    my $enc_name = "utf8";

    if ($path && $path =~ m{ \. ([^\s.]+) \z }x) {
        my $ext = $1;
        die unless defined $ext;
        require Encode;
        if (my $enc_obj = Encode::find_encoding($ext)) {
            my $name = $enc_obj->name || $ext;
            $enc_name = "encoding($name)";
        }
    }

    return 1 if eval {
        use warnings FATAL => "all";
        no strict "refs";
        binmode($handle, ":$enc_name");
        1;
    };

    for ($@) {
        s/ at .* line \d+\.//;
        s/$/ for $path/;
    }

    yuck("set_encoding: $@");

    return undef;
}

sub fix_extension {
    my $path = shift();
    my %Compress = (
        Z       =>  "zcat",
        z       => "gzcat",            # for uncompressing
        gz      => "gzcat",
        bz      => "bzcat",
        bz2     => "bzcat",
        bzip    => "bzcat",
        bzip2   => "bzcat",
        lzma    => "lzcat",
    );

    if ($path =~ m{ \. ( [^.\s] +) \z }x) {
        if (my $prog = $Compress{$1}) {
            return "$prog $path |";
        } 
    } 

    return $path;

}

瞳孔里扚悲伤 2024-07-24 09:42:29

我在谷歌搜索字符计数解决方案时偶然发现了这一点。
诚然,我对 perl 几乎一无所知,所以其中一些可能是错误的,但这里是我对 newt 解决方案的调整。

首先,无论如何都有一个内置的行计数变量,所以我就使用了它。 我想这可能会更有效率一些。
事实上,字符计数包括换行符,这可能不是你想要的,所以我删掉了 $_。
Perl 还抱怨 split() 的完成方式(隐式拆分,请参阅:为什么 Perl 抱怨“不推荐使用隐式分割到@_”?)所以我对此进行了调整。
我的输入文件是 UTF-8,所以我这样打开它们。 这可能有助于在包含非 ASCII 字符的输入文件中获得正确的字符计数。

这是代码:

open(FILE, "<:encoding(UTF-8)", "file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);
my @wordcounter;
while (<FILE>) {
    chomp($_);
    $chars += length($_);
    @wordcounter = split(/\W+/, $_);
    $words += @wordcounter;
}
$lines = $.;
close FILE;
print "\nlines=$lines, words=$words, chars=$chars\n";

I stumbled upon this while googling for a character count solution.
Admittedly, I know next to nothing about perl so some of this may be off base, but here are my tweaks of newt's solution.

First, there is a built-in line count variable anyway, so I just used that. This is probably a bit more efficient, I guess.
As it is, the character count includes newline characters, which is probably not what you want, so I chomped $_.
Perl also complained about the way the split() is done (implicit split, see: Why does Perl complain "Use of implicit split to @_ is deprecated"? ) so I tweaked that.
My input files are UTF-8 so I opened them as such. That probably helps get the correct character count in the input file contains non-ASCII characters.

Here's the code:

open(FILE, "<:encoding(UTF-8)", "file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);
my @wordcounter;
while (<FILE>) {
    chomp($_);
    $chars += length($_);
    @wordcounter = split(/\W+/, $_);
    $words += @wordcounter;
}
$lines = $.;
close FILE;
print "\nlines=$lines, words=$words, chars=$chars\n";
寂寞陪衬 2024-07-24 09:42:29

有一个 Perl Power Tools 项目,其目标是重建所有 Unix bin 实用程序,主要用于那些运行在没有 Unix 的操作系统上的用户。 是的,他们确实wc。 虽然实现有些过分,但它POSIX 兼容

当您查看 true

There is the Perl Power Tools project whose goal is to reconstruct all the Unix bin utilities, primarily for those on operating systems deprived of Unix. Yes, they did wc. The implementation is overkill, but it is POSIX compliant.

It gets a little ridiculous when you look at the GNU compliant implementation of true.

梦归所梦 2024-07-24 09:42:29

不认真的回答:

system("wc foo");

Non-serious answer:

system("wc foo");
友欢 2024-07-24 09:42:29

以固定大小的块读取文件可能比逐行读取更有效。 wc 二进制文件执行此操作。

#!/usr/bin/env perl

use constant BLOCK_SIZE => 16384;

for my $file (@ARGV) {
    open my $fh, '<', $file or do {
        warn "couldn't open $file: $!\n";
        continue;
    };

    my ($chars, $words, $lines) = (0, 0, 0);

    my ($new_word, $new_line);
    while ((my $size = sysread $fh, local $_, BLOCK_SIZE) > 0) {
        $chars += $size;
        $words += /\s+/g;
        $words-- if $new_word && /\A\s/;
        $lines += () = /\n/g;

        $new_word = /\s\Z/;
        $new_line = /\n\Z/;
    }
    $lines-- if $new_line;

    print "\t$lines\t$words\t$chars\t$file\n";
}

Reading the file in fixed-size chunks may be more efficient than reading line-by-line. The wc binary does this.

#!/usr/bin/env perl

use constant BLOCK_SIZE => 16384;

for my $file (@ARGV) {
    open my $fh, '<', $file or do {
        warn "couldn't open $file: $!\n";
        continue;
    };

    my ($chars, $words, $lines) = (0, 0, 0);

    my ($new_word, $new_line);
    while ((my $size = sysread $fh, local $_, BLOCK_SIZE) > 0) {
        $chars += $size;
        $words += /\s+/g;
        $words-- if $new_word && /\A\s/;
        $lines += () = /\n/g;

        $new_word = /\s\Z/;
        $new_line = /\n\Z/;
    }
    $lines-- if $new_line;

    print "\t$lines\t$words\t$chars\t$file\n";
}
永言不败 2024-07-24 09:42:29

为了能够计算字符而不是字节,请考虑以下内容:
(尝试使用中文或西里尔字母并以utf8保存文件)

use utf8;

my $file='file.txt';
my $LAYER = ':encoding(UTF-8)';
open( my $fh, '<', $file )
  || die( "$file couldn't be opened: $!" );
binmode( $fh, $LAYER );
read $fh, my $txt, -s $file;
close $fh;

print length $txt,$/;
use bytes;
print length $txt,$/;

To be able to count CHARS and not bytes, consider this:
(Try it with Chinese or Cyrillic letters and file saved in utf8)

use utf8;

my $file='file.txt';
my $LAYER = ':encoding(UTF-8)';
open( my $fh, '<', $file )
  || die( "$file couldn't be opened: $!" );
binmode( $fh, $LAYER );
read $fh, my $txt, -s $file;
close $fh;

print length $txt,$/;
use bytes;
print length $txt,$/;
薄情伤 2024-07-24 09:42:29

这可能对 Perl 初学者有帮助。
我尝试模拟 MS 字数统计功能,并添加了一项在 Linux 中使用 wc 时未显示的功能。

  • 行数 字数
  • 带空格的字符数
  • 带空格的字符数(wc 不会在其输出中给出此信息,但 Microsoft Word 会显示它。)

以下是 url:计算文件中的单词、字符和行

This may be helpful to Perl beginners.
I tried to simulate MS word counting functionalities and added one more feature which is not shown using wc in Linux.

  • number of lines
  • number of words
  • number of characters with space
  • number of characters without space (wc will not give this in its output but Microsoft words shows it.)

Here is the url: Counting words,characters and lines in a file

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文