Python字符串模式识别/压缩
我可以做基本的正则表达式,但这略有不同,即我不知道模式会是什么。
例如,我有一个相似字符串的列表:
lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
在这种情况下,常见模式是两段常见文本:'sometxt'
和 'moretxt'
,以其他长度可变的东西。
公共字符串和变量字符串当然可以以任意顺序和任意次数出现。
将字符串列表压缩/压缩为其公共部分和单独变体的好方法是什么?
示例输出可能是:
c = ['sometxt', 'moretxt']
v = [('a','0'), ('b','1'), ('aa','10'), ('zz','999')]
I can do basic regex alright, but this is slightly different, namely I don't know what the pattern is going to be.
For example, I have a list of similar strings:
lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
In this case the common pattern is two segments of common text: 'sometxt'
and 'moretxt'
, starting and separated by something else that is variable in length.
The common string and variable string can of course occur at any order and at any number of occasions.
What would be a good way to condense/compress the list of strings into their common parts and individual variations?
An example output might be:
c = ['sometxt', 'moretxt']
v = [('a','0'), ('b','1'), ('aa','10'), ('zz','999')]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
此解决方案找到两个最长的公共子字符串,并使用它们来分隔输入字符串:
find_delimiters
,朋友们找到分隔符:split_strings
使用分隔符分割字符串:This solution finds the two longest common substrings and uses them to delimit the input strings:
find_delimiters
and friends finds the delimiters:split_strings
splits the strings using the delimiters:这是一个令人恐惧的事情。
我希望它的丑陋能激发更好的解决方案:)
Here is a scary one to get the ball rolling.
I hope its sheer ugliness will inspire better solutions :)
这似乎是最长公共子序列问题的示例。一种方法可能是查看 diff 是如何生成的。 Hunt-McIlroy 算法 似乎是第一个,而且是最简单的,特别是因为它显然是非启发式的。
第一个链接包含详细的讨论和(伪)代码示例。当然,假设我不完全适应这里的轨道。
This seems to be an example of the longest common subsequence problem. One way could be to look at how diffs are generated. The Hunt-McIlroy algorithm seems to have been the first, and is such the simplest, especially since it apparently is non-heuristic.
The first link contains detailed discussion and (pseudo) code examples. Assuming, of course, Im not completely of the track here.
这看起来很像 LZW 算法:数据(文本)压缩。应该有 python 实现,您可以根据您的需要进行调整。
我假设您对这些经常重复的子字符串没有先验知识。
This look much like the LZW algorithm for data (text) compression. There should be python implementations out there, which you may be able to adapt to your need.
I assume you have no a priori knowledge of these sub strings that repeat often.
我想您应该首先识别字符串中经常出现的子字符串(模式)。由于天真地计算一组字符串中的子字符串在计算上相当昂贵,因此您需要想出一些聪明的方法。
我已经使用 广义后缀树 (此处示例)。一旦您知道数据中最常见的子字符串/模式,您就可以从那里获取它。
I guess you should start by identifying substrings (patterns) that frequently occur in the strings. Since naively counting substrings in a set of strings is rather computationally expensive, you'll need to come up with something smart.
I've done substring counting on a large amount of data using generalized suffix trees (example here). Once you know the most frequent substrings/patterns in the data, you can take it from there.
把已知的文本替换掉,然后分割怎么样?
How about subbing out the known text, and then splitting?