Python字符串模式识别/压缩

发布于 2024-08-14 11:36:31 字数 470 浏览 8 评论 0原文

我可以做基本的正则表达式，但这略有不同，即我不知道模式会是什么。

例如，我有一个相似字符串的列表：

lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']

在这种情况下，常见模式是两段常见文本：'sometxt' 和 'moretxt'，以其他长度可变的东西。

公共字符串和变量字符串当然可以以任意顺序和任意次数出现。

将字符串列表压缩/压缩为其公共部分和单独变体的好方法是什么？

示例输出可能是：

c = ['sometxt', 'moretxt']

v = [('a','0'), ('b','1'), ('aa','10'), ('zz','999')]

原文

I can do basic regex alright, but this is slightly different, namely I don't know what the pattern is going to be.

For example, I have a list of similar strings:

lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']

In this case the common pattern is two segments of common text: 'sometxt' and 'moretxt', starting and separated by something else that is variable in length.

The common string and variable string can of course occur at any order and at any number of occasions.

What would be a good way to condense/compress the list of strings into their common parts and individual variations?

An example output might be:

c = ['sometxt', 'moretxt']

v = [('a','0'), ('b','1'), ('aa','10'), ('zz','999')]

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

相思故 2024-08-21 11:36:31

此解决方案找到两个最长的公共子字符串，并使用它们来分隔输入字符串：

def an_answer_to_stackoverflow_question_1914394(lst):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> an_answer_to_stackoverflow_question_1914394(lst)
    (['sometxt', 'moretxt'], [('a', '0'), ('b', '1'), ('aa', '10'), ('zz', '999')])
    """
    delimiters = find_delimiters(lst)
    return delimiters, list(split_strings(lst, delimiters))

find_delimiters，朋友们找到分隔符：

import itertools

def find_delimiters(lst):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> find_delimiters(lst)
    ['sometxt', 'moretxt']
    """
    candidates = list(itertools.islice(find_longest_common_substrings(lst), 3))
    if len(candidates) == 3 and len(candidates[1]) == len(candidates[2]):
        raise ValueError("Unable to find useful delimiters")
    if candidates[1] in candidates[0]:
        raise ValueError("Unable to find useful delimiters")
    return candidates[0:2]

def find_longest_common_substrings(lst):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> list(itertools.islice(find_longest_common_substrings(lst), 3))
    ['sometxt', 'moretxt', 'sometx']
    """
    for i in xrange(min_length(lst), 0, -1):
        for substring in common_substrings(lst, i):
            yield substring


def min_length(lst):
    return min(len(item) for item in lst)

def common_substrings(lst, length):
    """
    >>> list(common_substrings(["hello", "world"], 2))
    []
    >>> list(common_substrings(["aabbcc", "dbbrra"], 2))
    ['bb']
    """
    assert length <= min_length(lst)
    returned = set()
    for i, item in enumerate(lst):
        for substring in all_substrings(item, length):
            in_all_others = True
            for j, other_item in enumerate(lst):
                if j == i:
                    continue
                if substring not in other_item:
                    in_all_others = False
            if in_all_others:
                if substring not in returned:
                    returned.add(substring)
                    yield substring

def all_substrings(item, length):
    """
    >>> list(all_substrings("hello", 2))
    ['he', 'el', 'll', 'lo']
    """
    for i in range(len(item) - length + 1):
        yield item[i:i+length]

split_strings 使用分隔符分割字符串：

import re

def split_strings(lst, delimiters):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> list(split_strings(lst, find_delimiters(lst)))
    [('a', '0'), ('b', '1'), ('aa', '10'), ('zz', '999')]
    """
    for item in lst:
        parts = re.split("|".join(delimiters), item)
        yield tuple(part for part in parts if part != '')

This solution finds the two longest common substrings and uses them to delimit the input strings:

def an_answer_to_stackoverflow_question_1914394(lst):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> an_answer_to_stackoverflow_question_1914394(lst)
    (['sometxt', 'moretxt'], [('a', '0'), ('b', '1'), ('aa', '10'), ('zz', '999')])
    """
    delimiters = find_delimiters(lst)
    return delimiters, list(split_strings(lst, delimiters))

find_delimiters and friends finds the delimiters:

import itertools

def find_delimiters(lst):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> find_delimiters(lst)
    ['sometxt', 'moretxt']
    """
    candidates = list(itertools.islice(find_longest_common_substrings(lst), 3))
    if len(candidates) == 3 and len(candidates[1]) == len(candidates[2]):
        raise ValueError("Unable to find useful delimiters")
    if candidates[1] in candidates[0]:
        raise ValueError("Unable to find useful delimiters")
    return candidates[0:2]

def find_longest_common_substrings(lst):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> list(itertools.islice(find_longest_common_substrings(lst), 3))
    ['sometxt', 'moretxt', 'sometx']
    """
    for i in xrange(min_length(lst), 0, -1):
        for substring in common_substrings(lst, i):
            yield substring


def min_length(lst):
    return min(len(item) for item in lst)

def common_substrings(lst, length):
    """
    >>> list(common_substrings(["hello", "world"], 2))
    []
    >>> list(common_substrings(["aabbcc", "dbbrra"], 2))
    ['bb']
    """
    assert length <= min_length(lst)
    returned = set()
    for i, item in enumerate(lst):
        for substring in all_substrings(item, length):
            in_all_others = True
            for j, other_item in enumerate(lst):
                if j == i:
                    continue
                if substring not in other_item:
                    in_all_others = False
            if in_all_others:
                if substring not in returned:
                    returned.add(substring)
                    yield substring

def all_substrings(item, length):
    """
    >>> list(all_substrings("hello", 2))
    ['he', 'el', 'll', 'lo']
    """
    for i in range(len(item) - length + 1):
        yield item[i:i+length]

split_strings splits the strings using the delimiters:

import re

def split_strings(lst, delimiters):
    """
    >>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
    >>> list(split_strings(lst, find_delimiters(lst)))
    [('a', '0'), ('b', '1'), ('aa', '10'), ('zz', '999')]
    """
    for item in lst:
        parts = re.split("|".join(delimiters), item)
        yield tuple(part for part in parts if part != '')

回复收藏 0 原文

柠檬色的秋千 2024-08-21 11:36:31

这是一个令人恐惧的事情。

>>> import re
>>> makere = lambda n: ''.join(['(.*?)(.+)(.*?)(.+)(.*?)'] + ['(.*)(\\2)(.*)(\\4)(.*)'] * (n - 1))
>>> inp = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> re.match(makere(len(inp)), ''.join(inp)).groups()
('a', 'sometxt', '0', 'moretxt', '', 'b', 'sometxt', '1', 'moretxt', 'aa', '', 'sometxt', '10', 'moretxt', 'zz', '', 'sometxt', '999', 'moretxt', '')

我希望它的丑陋能激发更好的解决方案:)

Here is a scary one to get the ball rolling.

>>> import re
>>> makere = lambda n: ''.join(['(.*?)(.+)(.*?)(.+)(.*?)'] + ['(.*)(\\2)(.*)(\\4)(.*)'] * (n - 1))
>>> inp = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> re.match(makere(len(inp)), ''.join(inp)).groups()
('a', 'sometxt', '0', 'moretxt', '', 'b', 'sometxt', '1', 'moretxt', 'aa', '', 'sometxt', '10', 'moretxt', 'zz', '', 'sometxt', '999', 'moretxt', '')

I hope its sheer ugliness will inspire better solutions :)

回复收藏 0 原文

暖树树初阳… 2024-08-21 11:36:31

这似乎是最长公共子序列问题的示例。一种方法可能是查看 diff 是如何生成的。 Hunt-McIlroy 算法似乎是第一个，而且是最简单的，特别是因为它显然是非启发式的。

第一个链接包含详细的讨论和（伪）代码示例。当然，假设我不完全适应这里的轨道。

回复收藏 0 原文

街角迷惘 2024-08-21 11:36:31

这看起来很像 LZW 算法：数据（文本）压缩。应该有 python 实现，您可以根据您的需要进行调整。

我假设您对这些经常重复的子字符串没有先验知识。

回复收藏 0 原文

清醇 2024-08-21 11:36:31

我想您应该首先识别字符串中经常出现的子字符串（模式）。由于天真地计算一组字符串中的子字符串在计算上相当昂贵，因此您需要想出一些聪明的方法。

我已经使用广义后缀树（此处示例）。一旦您知道数据中最常见的子字符串/模式，您就可以从那里获取它。

回复收藏 0 原文

近箐 2024-08-21 11:36:31

把已知的文本替换掉，然后分割怎么样？

import re
[re.sub('(sometxt|moretxt)', ',', x).split(',') for x in lst]
# results in
[['a', '0', ''], ['b', '1', ''], ['aa', '10', ''], ['zz', '999', '']]

How about subbing out the known text, and then splitting?

import re
[re.sub('(sometxt|moretxt)', ',', x).split(',') for x in lst]
# results in
[['a', '0', ''], ['b', '1', ''], ['aa', '10', ''], ['zz', '999', '']]

回复收藏 0 原文

~没有更多了~