计算常见字节、字和双字

发布于 2024-09-02 10:33:19 字数 100 浏览 3 评论 0原文

我正在扫描大量数据并寻找其中的共同趋势。每次遇到一个单元的重复出现时,我都想增加它的计数。保存这些数据的最佳数据结构或方法是什么?我需要能够快速搜索它,并且还需要对每个数据单元进行计数。

I am scanning over a large amount of data and looking for common trends in it. Every time I meet a recurrence of a unit, I want to increment the count of it. What is the best data structure or way to hold this data. I need to be able to search it quickly, and also have a count with each unit of data.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

感性不性感 2024-09-09 10:33:19

您没有指定语言,但 哈希(关联数组) 是您最好的数据结构。

有时可以根据语言将其称为映射/哈希映射(Java 有 HashMap、Perl 哈希哈希等。

哈希/关联数组/映射数据结构由一组键值对组成,值可以通过在您的例子中,键将是一个表示单词、字节或双字(单独的 3 个哈希图)的字符串,值将是频率计数。

You didn't specify a language, but a hash (associative array) is your best data structure.

It can sometimes be called a map/hashmap depending on a language (Java has HashMaps, Perl hash hashes, .

A hash/associative array/map data structure consists of a set of key-value pairs, with the values settable/gettable by the key. In you case, the key will be a string representing a word, a byte, or a double word (separate 3 hashmaps) and the value will be the count of the frequency.

待"谢繁草 2024-09-09 10:33:19

如果您需要快速查找,字典/哈希表将是最好的。

Dictionary/Hash table would be the best if you need fast look up.

ま昔日黯然 2024-09-09 10:33:19

正如已经提到的,字典/哈希表是你最好的选择。但你的问题有点清楚,我注意到你在标签中提到了压缩;你可能想看看霍夫曼树还。

As has been mentioned, dictionaries/hash tables are your best bet. But your question is a little clear and I noticed that you mentioned compression in your tags; you may want to look at Huffman trees also.

何必那么矫情 2024-09-09 10:33:19

正如其他人所指出的,哈希显然是您的数据结构的候选者。

然而,出于开发和测试的目的,我希望该结构比每个匹配项目的简单计数更丰富。相反,我想要存储可用于确认代码正确性的信息。

对于初学者来说,该信息可能包含行号和匹配发生位置的一些指示。这是 Perl 中的示例:

use strict;
use warnings;

my %regexes= (
    rep_letter => qr/ ([a-z])         (\1   )+ /x,
    rep_word   => qr/ (\b \w+ \b) \W* (\1\W*)+ /x,
    doub_word  => qr/ (\b \w+   ) \W+  \1      /x,
);

my %ds;

while (my $line = <>){
    for my $r (keys %regexes){
        while ( $line =~ /$regexes{$r}/g ){
            # Data structure:
            #   $ds{REGEX_TYPE}{REPEATED_ITEM} = [
            #       [LINE_NO, pos_VALUE_OF_MATCH],
            #       etc. for each match
            #   ]
            #
            # For example:
            #   $ds{rep_word}{foo} = [
            #       [ 3, 11],
            #       [12, 88],
            #       ...
            #   ]
            push @{$ds{$r}{$1}}, [$., pos($line)];
        }
    }
}

As others have noted, a hash is an obvious candidate for your data structure.

For development and testing purposes, however, I would want that structure to be richer than a simple tally for each matched item. Rather, I would want store information that could be used to confirm the correctness of the code.

For starters, that information might contain the line number and some indication of the position where the match occurred. Here is an illustration in Perl:

use strict;
use warnings;

my %regexes= (
    rep_letter => qr/ ([a-z])         (\1   )+ /x,
    rep_word   => qr/ (\b \w+ \b) \W* (\1\W*)+ /x,
    doub_word  => qr/ (\b \w+   ) \W+  \1      /x,
);

my %ds;

while (my $line = <>){
    for my $r (keys %regexes){
        while ( $line =~ /$regexes{$r}/g ){
            # Data structure:
            #   $ds{REGEX_TYPE}{REPEATED_ITEM} = [
            #       [LINE_NO, pos_VALUE_OF_MATCH],
            #       etc. for each match
            #   ]
            #
            # For example:
            #   $ds{rep_word}{foo} = [
            #       [ 3, 11],
            #       [12, 88],
            #       ...
            #   ]
            push @{$ds{$r}{$1}}, [$., pos($line)];
        }
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文