当前位置：文江博客话题详情

计算常见字节、字和双字

发布于 2024-09-02 10:33:19 字数 100 浏览 3 评论 0原文

我正在扫描大量数据并寻找其中的共同趋势。每次遇到一个单元的重复出现时，我都想增加它的计数。保存这些数据的最佳数据结构或方法是什么？我需要能够快速搜索它，并且还需要对每个数据单元进行计数。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

感性不性感 2024-09-09 10:33:19

您没有指定语言，但哈希（关联数组）是您最好的数据结构。

有时可以根据语言将其称为映射/哈希映射（Java 有 HashMap、Perl 哈希哈希等。

哈希/关联数组/映射数据结构由一组键值对组成，值可以通过在您的例子中，键将是一个表示单词、字节或双字（单独的 3 个哈希图）的字符串，值将是频率计数。

回复收藏 0 原文

待＂谢繁草 2024-09-09 10:33:19

如果您需要快速查找，字典/哈希表将是最好的。

回复收藏 0 原文

ま昔日黯然 2024-09-09 10:33:19

正如已经提到的，字典/哈希表是你最好的选择。但你的问题有点清楚，我注意到你在标签中提到了压缩；你可能想看看霍夫曼树还。

回复收藏 0 原文

何必那么矫情 2024-09-09 10:33:19

正如其他人所指出的，哈希显然是您的数据结构的候选者。

然而，出于开发和测试的目的，我希望该结构比每个匹配项目的简单计数更丰富。相反，我想要存储可用于确认代码正确性的信息。

对于初学者来说，该信息可能包含行号和匹配发生位置的一些指示。这是 Perl 中的示例：

use strict;
use warnings;

my %regexes= (
    rep_letter => qr/ ([a-z])         (\1   )+ /x,
    rep_word   => qr/ (\b \w+ \b) \W* (\1\W*)+ /x,
    doub_word  => qr/ (\b \w+   ) \W+  \1      /x,
);

my %ds;

while (my $line = <>){
    for my $r (keys %regexes){
        while ( $line =~ /$regexes{$r}/g ){
            # Data structure:
            #   $ds{REGEX_TYPE}{REPEATED_ITEM} = [
            #       [LINE_NO, pos_VALUE_OF_MATCH],
            #       etc. for each match
            #   ]
            #
            # For example:
            #   $ds{rep_word}{foo} = [
            #       [ 3, 11],
            #       [12, 88],
            #       ...
            #   ]
            push @{$ds{$r}{$1}}, [$., pos($line)];
        }
    }
}

As others have noted, a hash is an obvious candidate for your data structure.

For development and testing purposes, however, I would want that structure to be richer than a simple tally for each matched item. Rather, I would want store information that could be used to confirm the correctness of the code.

For starters, that information might contain the line number and some indication of the position where the match occurred. Here is an illustration in Perl:

use strict;
use warnings;

my %regexes= (
    rep_letter => qr/ ([a-z])         (\1   )+ /x,
    rep_word   => qr/ (\b \w+ \b) \W* (\1\W*)+ /x,
    doub_word  => qr/ (\b \w+   ) \W+  \1      /x,
);

my %ds;

while (my $line = <>){
    for my $r (keys %regexes){
        while ( $line =~ /$regexes{$r}/g ){
            # Data structure:
            #   $ds{REGEX_TYPE}{REPEATED_ITEM} = [
            #       [LINE_NO, pos_VALUE_OF_MATCH],
            #       etc. for each match
            #   ]
            #
            # For example:
            #   $ds{rep_word}{foo} = [
            #       [ 3, 11],
            #       [12, 88],
            #       ...
            #   ]
            push @{$ds{$r}{$1}}, [$., pos($line)];
        }
    }
}

回复收藏 0 原文

~没有更多了~