如何使用 Unix 命令收集文本文件中的字符使用统计信息?

发布于 2024-10-01 18:45:46 字数 192 浏览 8 评论 0原文

我有一个使用 OCR 软件创建的文本文件 - 大小约为 1 兆字节。 文档中到处都出现一些不常见的字符,其中大多数是 OCR 错误。

我想找到文档中使用的所有字符以轻松发现错误(例如 UNIQ 命令,但针对字符,而不针对行)。

我在Ubuntu上。 我应该使用什么 Unix 命令来显示文本文件中使用的所有字符?

I have got a text file created using OCR software - about one megabyte in size.
Some uncommon characters appears all over document and most of them are OCR errors.

I would like find all characters used in document to easily spot errors (like UNIQ command but for characters, not for lines).

I am on Ubuntu.
What Unix command I should use to display all characters used in text file?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

江城子 2024-10-08 18:45:46

这应该可以满足您的要求:

cat inputfile | sed 's/\(.\)/\1\n/g' | sort | uniq -c

前提是 sed 将文件中的每个字符单独放在一行上,然后进行通常的 sort | 操作。 uniq -c 序列会删除出现的每个唯一字符(仅保留一个),并提供每个出现的次数的计数。

另外,您可以附加 | sort -n 到整个序列的末尾,按每个字符出现的次数对输出进行排序。例子:

$ echo hello |  sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -n
  1 
  1 e
  1 h
  1 o
  2 l

This should do what you're looking for:

cat inputfile | sed 's/\(.\)/\1\n/g' | sort | uniq -c

The premise is that the sed puts each character in the file onto a line by itself, then the usual sort | uniq -c sequence strips out all but one of each unique character that occurs, and provides counts of how many times each occurred.

Also, you could append | sort -n to the end of the whole sequence to sort the output by how many times each character occurred. Example:

$ echo hello |  sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -n
  1 
  1 e
  1 h
  1 o
  2 l
以为你会在 2024-10-08 18:45:46

这将做到这一点:

#!/usr/bin/perl -n
#
# charcounts - show how many times each code point is used
# Tom Christiansen <[email protected]>

use open ":utf8";

++$seen{ ord() } for split //;

END {
    for my $cp (sort {$seen{$b} <=> $seen{$a}} keys %seen) {
        printf "%04X %d\n", $cp, $seen{$cp};
    }
}

自行运行,该程序会生成:

$ charcounts /tmp/charcounts | head
0020 46
0065 20
0073 18
006E 15
000A 14
006F 12
0072 11
0074 10
0063 9
0070 9

如果您也想要文字字符和/或字符名称,那么很容易添加。

如果您想要更复杂的东西,该程序可以通过 Unicode 属性来计算字符。它可能足以满足您的目的,如果不能,您应该能够对其进行调整。

#!/usr/bin/perl
#
# unicats - show character distribution by Unicode character property
# Tom Christiansen <[email protected]>

use strict;
use warnings qw<FATAL all>;

use open ":utf8";

my %cats;

our %Prop_Table;
build_prop_table();

if (@ARGV == 0 && -t STDIN) {
    warn <<"END_WARNING";
$0: reading UTF-8 character data directly from your tty
\tSo please type stuff...
\t and then hit your tty's EOF sequence when done.
END_WARNING

} 

while (<>) {
    for (split(//)) {

        $cats{Total}++;

        if (/\p{ASCII}/) { $cats{ASCII}++   } 
        else             { $cats{Unicode}++ } 

        my $gcat   = get_general_category($_);
        $cats{$gcat}++;

        my $subcat = get_general_subcategory($_);
        $cats{$subcat}++;

    } 
} 

my $width = length $cats{Total};

my $mask = "%*d %s\n";

for my $cat(qw< Total ASCII Unicode >) { 
    printf $mask, $width => $cats{$cat} || 0, $cat; 
}
print "\n";

my @catnames = qw[
    L Lu Ll Lt Lm Lo
    N Nd Nl No
    S Sm Sc Sk So
    P Pc Pd Ps Pe Pi Pf Po
    M Mn Mc Me
    Z Zs Zl Zp
    C Cc Cf Cs Co Cn
];

#for my $cat (sort keys %cats) {
for my $cat (@catnames) {
    next if length($cat) > 2;
    next unless $cats{$cat};

    my $prop = length($cat) == 1 
                 ? ( " " . q<\p> .   $cat          )
                 : (       q<\p> . "{$cat}" . "\t" )
             ;

    my $desc = sprintf("%-6s %s", $prop, $Prop_Table{$cat});

    printf $mask, $width => $cats{$cat}, $desc;
} 

exit;

sub get_general_category {
    my $_ = shift();
    return "L" if /\pL/;
    return "S" if /\pS/;
    return "P" if /\pP/;
    return "N" if /\pN/;
    return "C" if /\pC/;
    return "M" if /\pM/;
    return "Z" if /\pZ/;

    die "not reached one: $_";
} 

sub get_general_subcategory {
    my $_ = shift();

    return "Lu" if /\p{Lu}/;
    return "Ll" if /\p{Ll}/;
    return "Lt" if /\p{Lt}/;
    return "Lm" if /\p{Lm}/;
    return "Lo" if /\p{Lo}/;

    return "Mn" if /\p{Mn}/;
    return "Mc" if /\p{Mc}/;
    return "Me" if /\p{Me}/;

    return "Nd" if /\p{Nd}/;
    return "Nl" if /\p{Nl}/;
    return "No" if /\p{No}/;

    return "Pc" if /\p{Pc}/;
    return "Pd" if /\p{Pd}/;
    return "Ps" if /\p{Ps}/;
    return "Pe" if /\p{Pe}/;
    return "Pi" if /\p{Pi}/;
    return "Pf" if /\p{Pf}/;
    return "Po" if /\p{Po}/;

    return "Sm" if /\p{Sm}/;
    return "Sc" if /\p{Sc}/;
    return "Sk" if /\p{Sk}/;
    return "So" if /\p{So}/;

    return "Zs" if /\p{Zs}/;
    return "Zl" if /\p{Zl}/;
    return "Zp" if /\p{Zp}/;

    return "Cc" if /\p{Cc}/;
    return "Cf" if /\p{Cf}/;
    return "Cs" if /\p{Cs}/;
    return "Co" if /\p{Co}/;
    return "Cn" if /\p{Cn}/;

    die "not reached two: <$_> " . sprintf("U+%vX", $_);

}

sub build_prop_table { 

    for my $line (<<"End_of_Property_List" =~ m{ \S .* \S }gx) {

       L           Letter
       Lu          Uppercase_Letter
       Ll          Lowercase_Letter
       Lt          Titlecase_Letter
       Lm          Modifier_Letter
       Lo          Other_Letter

       M           Mark  (combining characters, including diacritics)
       Mn          Nonspacing_Mark
       Mc          Spacing_Mark
       Me          Enclosing_Mark

       N           Number
       Nd          Decimal_Number (also Digit)
       Nl          Letter_Number
       No          Other_Number

       P           Punctuation
       Pc          Connector_Punctuation
       Pd          Dash_Punctuation
       Ps          Open_Punctuation
       Pe          Close_Punctuation
       Pi          Initial_Punctuation (may behave like Ps or Pe depending on usage)
       Pf          Final_Punctuation (may behave like Ps or Pe depending on usage)
       Po          Other_Punctuation

       S           Symbol
       Sm          Math_Symbol
       Sc          Currency_Symbol
       Sk          Modifier_Symbol
       So          Other_Symbol

       Z           Separator
       Zs          Space_Separator
       Zl          Line_Separator
       Zp          Paragraph_Separator

       C           Other (means not L/N/P/S/Z)
       Cc          Control (also Cntrl)
       Cf          Format
       Cs          Surrogate   (not usable)
       Co          Private_Use
       Cn          Unassigned

End_of_Property_List

            my($short_prop, $long_prop) = $line =~ m{ 
                \b 
                 ( \p{Lu}  \p{Ll}   ? ) 
                \s + 
                 ( \p{Lu} [\p{L&}_] + )
                \b
            }x;

            $Prop_Table{$short_prop} = $long_prop;

    }

}

例如:

$ unicats book.txt
2357232 Total
2357199 ASCII
     33 Unicode

1604949  \pL   Letter
  74455 \p{Lu}   Uppercase_Letter
1530485 \p{Ll}   Lowercase_Letter
      9 \p{Lo}   Other_Letter
  10676  \pN   Number
  10676 \p{Nd}   Decimal_Number
  19679  \pS   Symbol
  10705 \p{Sm}   Math_Symbol
   8365 \p{Sc}   Currency_Symbol
    603 \p{Sk}   Modifier_Symbol
      6 \p{So}   Other_Symbol
 111899  \pP   Punctuation
   2996 \p{Pc}   Connector_Punctuation
   6145 \p{Pd}   Dash_Punctuation
  11392 \p{Ps}   Open_Punctuation
  11371 \p{Pe}   Close_Punctuation
  79995 \p{Po}   Other_Punctuation
 548529  \pZ   Separator
 548529 \p{Zs}   Space_Separator
  61500  \pC   Other
  61500 \p{Cc}   Control

This will do it:

#!/usr/bin/perl -n
#
# charcounts - show how many times each code point is used
# Tom Christiansen <[email protected]>

use open ":utf8";

++$seen{ ord() } for split //;

END {
    for my $cp (sort {$seen{$b} <=> $seen{$a}} keys %seen) {
        printf "%04X %d\n", $cp, $seen{$cp};
    }
}

Run on itself, that program produces:

$ charcounts /tmp/charcounts | head
0020 46
0065 20
0073 18
006E 15
000A 14
006F 12
0072 11
0074 10
0063 9
0070 9

If you want the literal character and/or name of the character, too, that’s easy to add.

If you want something more sophisticated, this program figures out characters by Unicode property. It may be enough for your purposes, and if not, you should be able to adapt it.

#!/usr/bin/perl
#
# unicats - show character distribution by Unicode character property
# Tom Christiansen <[email protected]>

use strict;
use warnings qw<FATAL all>;

use open ":utf8";

my %cats;

our %Prop_Table;
build_prop_table();

if (@ARGV == 0 && -t STDIN) {
    warn <<"END_WARNING";
$0: reading UTF-8 character data directly from your tty
\tSo please type stuff...
\t and then hit your tty's EOF sequence when done.
END_WARNING

} 

while (<>) {
    for (split(//)) {

        $cats{Total}++;

        if (/\p{ASCII}/) { $cats{ASCII}++   } 
        else             { $cats{Unicode}++ } 

        my $gcat   = get_general_category($_);
        $cats{$gcat}++;

        my $subcat = get_general_subcategory($_);
        $cats{$subcat}++;

    } 
} 

my $width = length $cats{Total};

my $mask = "%*d %s\n";

for my $cat(qw< Total ASCII Unicode >) { 
    printf $mask, $width => $cats{$cat} || 0, $cat; 
}
print "\n";

my @catnames = qw[
    L Lu Ll Lt Lm Lo
    N Nd Nl No
    S Sm Sc Sk So
    P Pc Pd Ps Pe Pi Pf Po
    M Mn Mc Me
    Z Zs Zl Zp
    C Cc Cf Cs Co Cn
];

#for my $cat (sort keys %cats) {
for my $cat (@catnames) {
    next if length($cat) > 2;
    next unless $cats{$cat};

    my $prop = length($cat) == 1 
                 ? ( " " . q<\p> .   $cat          )
                 : (       q<\p> . "{$cat}" . "\t" )
             ;

    my $desc = sprintf("%-6s %s", $prop, $Prop_Table{$cat});

    printf $mask, $width => $cats{$cat}, $desc;
} 

exit;

sub get_general_category {
    my $_ = shift();
    return "L" if /\pL/;
    return "S" if /\pS/;
    return "P" if /\pP/;
    return "N" if /\pN/;
    return "C" if /\pC/;
    return "M" if /\pM/;
    return "Z" if /\pZ/;

    die "not reached one: $_";
} 

sub get_general_subcategory {
    my $_ = shift();

    return "Lu" if /\p{Lu}/;
    return "Ll" if /\p{Ll}/;
    return "Lt" if /\p{Lt}/;
    return "Lm" if /\p{Lm}/;
    return "Lo" if /\p{Lo}/;

    return "Mn" if /\p{Mn}/;
    return "Mc" if /\p{Mc}/;
    return "Me" if /\p{Me}/;

    return "Nd" if /\p{Nd}/;
    return "Nl" if /\p{Nl}/;
    return "No" if /\p{No}/;

    return "Pc" if /\p{Pc}/;
    return "Pd" if /\p{Pd}/;
    return "Ps" if /\p{Ps}/;
    return "Pe" if /\p{Pe}/;
    return "Pi" if /\p{Pi}/;
    return "Pf" if /\p{Pf}/;
    return "Po" if /\p{Po}/;

    return "Sm" if /\p{Sm}/;
    return "Sc" if /\p{Sc}/;
    return "Sk" if /\p{Sk}/;
    return "So" if /\p{So}/;

    return "Zs" if /\p{Zs}/;
    return "Zl" if /\p{Zl}/;
    return "Zp" if /\p{Zp}/;

    return "Cc" if /\p{Cc}/;
    return "Cf" if /\p{Cf}/;
    return "Cs" if /\p{Cs}/;
    return "Co" if /\p{Co}/;
    return "Cn" if /\p{Cn}/;

    die "not reached two: <$_> " . sprintf("U+%vX", $_);

}

sub build_prop_table { 

    for my $line (<<"End_of_Property_List" =~ m{ \S .* \S }gx) {

       L           Letter
       Lu          Uppercase_Letter
       Ll          Lowercase_Letter
       Lt          Titlecase_Letter
       Lm          Modifier_Letter
       Lo          Other_Letter

       M           Mark  (combining characters, including diacritics)
       Mn          Nonspacing_Mark
       Mc          Spacing_Mark
       Me          Enclosing_Mark

       N           Number
       Nd          Decimal_Number (also Digit)
       Nl          Letter_Number
       No          Other_Number

       P           Punctuation
       Pc          Connector_Punctuation
       Pd          Dash_Punctuation
       Ps          Open_Punctuation
       Pe          Close_Punctuation
       Pi          Initial_Punctuation (may behave like Ps or Pe depending on usage)
       Pf          Final_Punctuation (may behave like Ps or Pe depending on usage)
       Po          Other_Punctuation

       S           Symbol
       Sm          Math_Symbol
       Sc          Currency_Symbol
       Sk          Modifier_Symbol
       So          Other_Symbol

       Z           Separator
       Zs          Space_Separator
       Zl          Line_Separator
       Zp          Paragraph_Separator

       C           Other (means not L/N/P/S/Z)
       Cc          Control (also Cntrl)
       Cf          Format
       Cs          Surrogate   (not usable)
       Co          Private_Use
       Cn          Unassigned

End_of_Property_List

            my($short_prop, $long_prop) = $line =~ m{ 
                \b 
                 ( \p{Lu}  \p{Ll}   ? ) 
                \s + 
                 ( \p{Lu} [\p{L&}_] + )
                \b
            }x;

            $Prop_Table{$short_prop} = $long_prop;

    }

}

For example:

$ unicats book.txt
2357232 Total
2357199 ASCII
     33 Unicode

1604949  \pL   Letter
  74455 \p{Lu}   Uppercase_Letter
1530485 \p{Ll}   Lowercase_Letter
      9 \p{Lo}   Other_Letter
  10676  \pN   Number
  10676 \p{Nd}   Decimal_Number
  19679  \pS   Symbol
  10705 \p{Sm}   Math_Symbol
   8365 \p{Sc}   Currency_Symbol
    603 \p{Sk}   Modifier_Symbol
      6 \p{So}   Other_Symbol
 111899  \pP   Punctuation
   2996 \p{Pc}   Connector_Punctuation
   6145 \p{Pd}   Dash_Punctuation
  11392 \p{Ps}   Open_Punctuation
  11371 \p{Pe}   Close_Punctuation
  79995 \p{Po}   Other_Punctuation
 548529  \pZ   Separator
 548529 \p{Zs}   Space_Separator
  61500  \pC   Other
  61500 \p{Cc}   Control
眸中客 2024-10-08 18:45:46

至于使用 *nix 命令,上面的答案很好,但它没有获得使用统计信息。

但是,如果您确实想要文件上的统计信息(例如最稀有使用的、中值的、最常用的等),这个 Python 应该可以做到。

def get_char_counts(fname):
    f = open(fname)
    usage = {}
    for c in f.read():
        if c not in usage:
            usage.update({c:1})
        else:
            usage[c] += 1
    return usage

As far as using *nix commands, the answer above is good, but it doesn't get usage stats.

However, if you actually want stats (like the rarest used, median, most used, etc) on the file, this Python should do it.

def get_char_counts(fname):
    f = open(fname)
    usage = {}
    for c in f.read():
        if c not in usage:
            usage.update({c:1})
        else:
            usage[c] += 1
    return usage
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文