频率分析算法

发布于 2024-08-12 20:33:29 字数 190 浏览 9 评论 0原文

我想编写一个java程序来搜索密文并返回密文中字符的频率计数,例如密文: “jshddllpkeldldwgbdpked”的结果如下:

出现 2 个字母:

pk = 2, ke = 2, ld = 2

出现 3 个字母:

pke = 2。

有什么算法可以让我尽可能高效地完成此操作?

I want to write a java program that searches through a cipher text and returns a frequency count of the characters in the cipher, for example the cipher:
"jshddllpkeldldwgbdpked" will have a result like this:

2 letter occurrences:

pk = 2, ke = 2, ld = 2

3 letter occurrences:

pke = 2.

Any algorithm that allows me to do this as efficiently as possible?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

清风疏影 2024-08-19 20:33:29

映射策略是一个很好的策略,但我会选择 HashMap,因为它是计算字符元组。

迭代密文中的字符,您可以保存最后 X 个字符,这将为您提供所有出现的长度为 X+1 的子字符串的映射。

The map strategy is a good one, but I'd go for HashMap<String, Integer>, since it's tuples of characters being counted.

Iterating over the characters in the ciphertext, you can save the last X characters and that will give you a map over all occurrences of substrings of length X+1.

季末如歌 2024-08-19 20:33:29

通常的方法是使用某种地图将角色映射到其计数。例如,您可以使用 HashMap。然后,您可以按字符迭代密文,并将字符放入映射中,计数为 1(如果尚不存在)或增加其计数。

The usual approach would be to use some kind of map to map your characters to their counts. You can use a HashMap<Character, Integer> for example. You can then iterate through your ciphertext, character-wise and either put the character into the map with a count of 1 (if it doesn't yet exist) or increment its count.

心清如水 2024-08-19 20:33:29

您可以将 n-grams 存储在 trie,反转正常顺序,使 n 元语法中的最后一个字符位于 trie 的顶部。特里树中的每个节点都存储一个字符计数。循环字符串,跟踪最后 N 个字符(如 Buhb 建议)。每次通过外循环时,您都会遍历 trie,使用最后 N 个字符来选择路径,从​​最后一个字符开始,到第 Nth 为止。对于您访问的每个节点,递增其计数器。

要打印 n 元语法频率,请对 trie 执行广度优先遍历。

整体表现留作练习。

You could store the n-grams in a trie, reversing the normal order so the last character in the n-gram is at the top of the trie. Each node in the trie stores a character count. Loop over the string, keeping track of the last N characters (as Buhb suggests). Each time through the outer loop, you traverse the trie, using the last N characters to pick the path, starting with the last character and ending with the Nth to last. For each node you visit, incrementing its counter.

To print the n-gram frequencies, perform a breadth-first traversal of the trie.

Overall performance left as an exercise.

千仐 2024-08-19 20:33:29

要么使用一个数组,其中每个可能的值都有一个单元格(如果密文全部是小写字符,则很容易 - 26 - 如果不是,则更难),或者选择一个 Map,在其中传递字符并在任一情况下递增值。阵列速度更快,但灵活性较差。

Either have an array with a cell for each possible value (easy if the cipher text is all lower case characters - 26 - harder if not) or go for a Map where you pass in the character and increment the value in either case. The array is quicker but less flexible.

迟到的我 2024-08-19 20:33:29

如果您需要的序列长度集是固定的,则明显的算法需要线性数量的计数操作(例如,在哈希表中查找计数器并递增它)。

当您说“尽可能高效”时,您是否打算花费大量精力来进行微薄的常数因子改进,绝望地搜索次线性算法,或者您根本不理解算法复杂性类别?

If the set of lengths of sequences you need is fixed, the obvious algorithm takes a linear number of counting operations (say, looking up a counter in a hashtable and incrementing it).

When you say "as efficiently as possible", do you propose to spend a lot of effort for a meagre constant-factor improvement, to search hopelessly for a sublinear algorithm, or do you not understand algorithm complexity classes at all?

2024-08-19 20:33:29

您可以使用哈希或图(感谢outis,我现在知道它的特殊名称,这种图称为“trie”)。哈希会更慢,图会更快。在糟糕的实现中,散列将获得更少的内存,图将占用更多的内存。

您无法使用数组来完成此操作,因为如果最大字符序列长度等于文本长度并且文本足够长,它将获得大量内存。如果你限制它,它将得到类似 ([字母数量]^[最大序列长度])*4 字节,这将是 (52^4)*4 ~= 24Mb< /code> 内存 4 个小/大写字母序列。如果有限的序列长度对你来说是可以的,并且这个内存量是正常的,那么对于<=4个字母的序列来说,算法将非常容易。

You can use hash or graph (Thanks to outis, I know it's special name now, such kind of graphs is called "trie"). Hash will be slower, graph will be faster. Hash will get less memory, graph will take more in bad implementation.

You cannot get it done using array since it will get HUGE amount of memory if your maximum char sequence length is equal to your text length, and text is long enough. If you will limit it it will get smth like ([number of letters]^[max sequence length])*4 bytes which will be (52^4)*4 ~= 24Mb of memory for 4 lower/upper letter sequence. If limited sequence length is ok for you and this memory amount is normal than algorithm will be pretty easy for <=4 letters in sequence.

梦里寻她 2024-08-19 20:33:29

您可以首先寻找最大的可能的可重复序列,然后从那里开始。例如,如果字符串有 10 个字符,则可能出现的最大可重复序列将是 5 个字母长,因此首先查找 5 个字母序列,然后查找 4 个字母,依此类推,直到达到 2 个。这应该会减少程序中的迭代次数。

You could start by looking for the largest possible repeatable sequence first then work your way down from there. For example if the string is 10 characters the largest repeatable sequence that could occur would be 5 letters long so first look for 5 letter sequences then 4 letters and so on till you reach 2. This should reduce the number of iterations in your program.

暖树树初阳… 2024-08-19 20:33:29

我对此没有答案,

但我觉得,这个算法与压缩算法使用字典方法创建压缩文件所使用的算法完全相同。

如果我没记错的话,在这种方法中,字典按以下方式使用:

data:

abccccabaccabcaaaaabcaaabbbbbccccaaabcbbbbabbabab

parse 1 :
钥匙: *
值:abc

新数据:

*cccabacc*aaaa*aaabbbbbccccaa*bbbbabbabab

只是一个有根据的猜测,我认为(此处不确定)标准“zip”文件使用这种方法,
所以我建议你看看这些算法

I dont have an answer in mind for this,

But I feel, this algorithm is the exact same as the algorithm used by compression algorithms to create compressed files with the dictionary approach.

If I am not wrong, in this approach, a dictionary is used in the following manner:

data:

abccccabaccabcaaaaabcaaabbbbbccccaaabcbbbbabbabab

parse 1 :
key: *
value: abc

new data:

*cccabacc*aaaa*aaabbbbbccccaa*bbbbabbabab

Just an educated guess, I think (not sure here) the standard "zip" file uses this approach,
so I suggest you look at these algorithms

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文