我将如何找到文件中最常见的子字符串

发布于 2025-01-21 11:05:33 字数 151 浏览 4 评论 0原文

序言，我正在尝试创建自己的压缩方法，其中我不在乎速度，因此大量文件上的许多迭代都是合理的。但是，我想知道是否有任何方法可以获得最常见的长度为2或更多的子字（最有可能的3个），因为任何较大的都不是合理的。我想知道您是否可以在不分开的情况下执行此操作，或者没有表格，只需搜索字符串即可。谢谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

巾帼英雄 2025-01-28 11:05:33

您可能想使用collections.counter将每个子字符串与计数相关联的内容，例如：

>>> data = "the quick brown fox jumps over the lazy dog"
>>> c = collections.Counter(data[i:i+2] for i in range(len(data)-2))
>>> max(c, key=c.get)
'th'
>>> c = collections.Counter(data[i:i+3] for i in range(len(data)-3))
>>> max(c, key=c.get)
'the'

You probably want to use something like collections.Counter to associate each substring with a count, e.g.:

>>> data = "the quick brown fox jumps over the lazy dog"
>>> c = collections.Counter(data[i:i+2] for i in range(len(data)-2))
>>> max(c, key=c.get)
'th'
>>> c = collections.Counter(data[i:i+3] for i in range(len(data)-3))
>>> max(c, key=c.get)
'the'

回复收藏 0 原文

~没有更多了~