SequenceMatcher 用于多个输入，而不仅仅是两个？

发布于 2024-08-27 13:56:41 字数 662 浏览 10 评论 0原文

想知道解决这个特定问题的最佳方法以及是否有任何库（最好是Python，但如果需要的话我可以灵活处理）。

我有一个文件，每行都有一个字符串。我想找到最长的常见模式及其在每行中的位置。我知道我可以使用 SequenceMatcher 来比较第一行和第二行、第一行和第三行，依此类推，然后将结果关联起来，但是是否有东西已经做到了呢？

理想情况下，这些匹配项会出现在每行的任何位置，但对于初学者来说，我可以接受它们存在于每行中相同的偏移量处，然后从那里开始。像压缩库这样具有良好的 API 来访问其字符串表的东西可能是理想的，但到目前为止我还没有找到任何符合该描述的东西。

例如，对于这些行：

\x00\x00\x8c\x9e\x28\x28\x62\xf2\x97\x47\x81\x40\x3e\x4b\xa6\x0e\xfe\x8b
\x00\x00\xa8\x23\x2d\x28\x28\x0e\xb3\x47\x81\x40\x3e\x9c\xfa\x0b\x78\xed
\x00\x00\xb5\x30\xed\xe9\xac\x28\x28\x4b\x81\x40\x3e\xe7\xb2\x78\x7d\x3e

我希望看到 0-1 和 10-12 在同一位置的所有行中匹配，并且 line1[4,5] 匹配 line2[5,6] 匹配 line3[7,8]。

谢谢，

原文

wondering about the best way to approach this particular problem and if any libraries (python preferably, but I can be flexible if need be).

I have a file with a string on each line. I would like to find the longest common patterns and their locations in each line. I know that I can use SequenceMatcher to compare line one and two, one and three, so on and then correlate the results, but if there something that already does it?

Ideally these matches would appear anywhere on each line, but for starters I can be fine with them existing at the same offset in each line and go from there. Something like a compression library that has a good API to access its string table might be ideal, but I have not found anything so far that fits that description.

For instance with these lines:

\x00\x00\x8c\x9e\x28\x28\x62\xf2\x97\x47\x81\x40\x3e\x4b\xa6\x0e\xfe\x8b
\x00\x00\xa8\x23\x2d\x28\x28\x0e\xb3\x47\x81\x40\x3e\x9c\xfa\x0b\x78\xed
\x00\x00\xb5\x30\xed\xe9\xac\x28\x28\x4b\x81\x40\x3e\xe7\xb2\x78\x7d\x3e

I would want to see that 0-1, and 10-12 match in all lines at the same position and line1[4,5] matches line2[5,6] matches line3[7,8].

Thanks,

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

琉璃繁缕 2024-09-03 13:56:41

如果您想要的只是找到每行中具有相同偏移量的公共子字符串，那么您所需要的就是这样的：

matches = []
zipped_strings = zip(s1,s2,s3)
startpos = -1
for i in len(zipped_strings):
  c1,c2,c3 = zipped_strings[i]
  # if you're not inside a match, 
  #  look for matching characters and save the match start position
  if startpos==-1 and c1==c2==c3:
    startpos = i
  # if you are inside a match, 
  #  look for non-matching characters, save the match to matches, reset startpos
  elif startpos>-1 and not c1==c2==c3:
    matches.append((startpos,i,s1[startpos:i]))
    # matches will contain (startpos,endpos,matchstring) tuples
    startpos = -1
# if you're still inside a match when you run out of string, save that match too!
if startpos>-1:
  endpos = len(zipped_strings)
  matches.append((startpos,endpos,s1[startpos:endpos]))

要找到最长的公共模式（无论位置如何），SequenceMatcher 听起来确实是最好的主意，但不是比较 string1到 string2，然后从 string1 到 string3 并尝试合并结果，只需获取 string1 和 string2 的所有公共子字符串（使用 get_matching_blocks），然后将其每个结果与 string3 进行比较以获取所有三个字符串之间的匹配项。

If all you want is to find common substrings that are at the same offset in each line, all you need is something like this:

matches = []
zipped_strings = zip(s1,s2,s3)
startpos = -1
for i in len(zipped_strings):
  c1,c2,c3 = zipped_strings[i]
  # if you're not inside a match, 
  #  look for matching characters and save the match start position
  if startpos==-1 and c1==c2==c3:
    startpos = i
  # if you are inside a match, 
  #  look for non-matching characters, save the match to matches, reset startpos
  elif startpos>-1 and not c1==c2==c3:
    matches.append((startpos,i,s1[startpos:i]))
    # matches will contain (startpos,endpos,matchstring) tuples
    startpos = -1
# if you're still inside a match when you run out of string, save that match too!
if startpos>-1:
  endpos = len(zipped_strings)
  matches.append((startpos,endpos,s1[startpos:endpos]))

To find the longest common pattern regardless of location, SequenceMatcher does sound like the best idea, but instead of comparing string1 to string2 and then string1 to string3 and trying to merge the results, just get all common substrings of string1 and string2 (with get_matching_blocks), and then compare each result of that to string3 to get matches between all three strings.

回复收藏 0 原文

~没有更多了~