忽略字符串中的大小写、标点符号和空格

发布于 2024-08-19 16:49:24 字数 3318 浏览 7 评论 0原文

忽略字符串中的大小写、标点符号和空格的最有效方法是什么?这些字符串应该被划分为单词而不是字符,应该忽略前面提到的比较细节,并且这些单词字符串的切片应该尽可能高效并考虑到速度。

我本来打算在下面的代码中使用不区分大小写和标点符号的字符串,但是在看到评估 class Slice: def __eq__(self, other): return self.root == other.rootclass Slice: def __eq__(self, other): return self.root == other.root,我决定使用 data = tuple(string.split()) 来代替。对于下面代码中已经表达的计算量大的算法来说,使用对大小写、标点符号和空格不敏感并且处理单词而不是字符的字符串过于昂贵。

class Slice:

    def __init__(self, data, offset, length):
        self.prefix = data[:offset]
        self.root = data[offset:offset+length]
        self.suffix = data[offset+length:]

    def __eq__(self, other):
        return self.root == other.root

    def __len__(self):
        return len(self.root)

################################################################################

class Match:

    def __init__(self, data, key, prefix_tree, suffix_tree):
        self.data = data
        self.key = key
        self.prefix_tree = prefix_tree
        self.suffix_tree = suffix_tree
        self.__value = len(key) + prefix_tree.value() + suffix_tree.value()

    def value(self):
        return self.__value

################################################################################

class Tree(tuple):

    def __new__(cls, nodes):
        tree = super().__new__(cls, nodes)
        tree.__value = max(map(Match.value, tree)) if tree else 0
        return tree

    def value(self):
        return self.__value

    def find(self, value):
        for index, match in enumerate(self):
            if match.value() == value:
                return index
        raise ValueError()

################################################################################

def search(data, key):
    length = 0
    nodes = []
    for d_block in shrink(data, len(key)):
        block_len = len(d_block)
        if length > block_len:
            return Tree(nodes)
        for k_block in slide(key, block_len):
            if d_block == k_block:
                length = block_len
                prefix_tree = search(d_block.prefix, k_block.prefix)
                suffix_tree = search(d_block.suffix, k_block.suffix)
                match = Match(d_block, k_block, prefix_tree, suffix_tree)
                nodes.append(match)
    return Tree(nodes)

def shrink(data, max_len):
    for length in range(min(len(data), max_len), 0, -1):
        for block in slide(data, length):
            yield block

def slide(data, length):
    for offset in range(len(data) - length + 1):
        yield Slice(data, offset, length)

################################################################################

def build_tree(nodes):
    match = nodes[nodes.find(nodes.value())]
    node = match.key
    if match.prefix_tree:
        node.prefix = build_tree(match.prefix_tree)
    if match.suffix_tree:
        node.suffix = build_tree(match.suffix_tree)
    return node

def flatten_tree(node):
    array = [0]
    _flatten(node, array)
    return tuple(array)

def _flatten(node, array):
    if isinstance(node.prefix, Slice):
        _flatten(node.prefix, array)
    else:
        array.append(node.prefix)
    array[0] += 1
    array.append((array[0], node.root))
    if isinstance(node.suffix, Slice):
        _flatten(node.suffix, array)
    else:
        array.append(node.suffix)

What is the most efficient way of ignoring case, punctuation, and whitespace in strings? These strings should be divided into words instead of characters should ignore the aforementioned details on comparisons, and slices of these word-strings should be as efficient as possible with speed in mind.

I was going to use case and punctuation insensitive strings for the following code, but after seeing how long it would take to evaluate class Slice: def __eq__(self, other): return self.root == other.root, I have decided to work with data = tuple(string.split()) instead. Having strings that are insensitive to case, punctuation, and spacing and that work over words instead of characters was too expensive into the computationally expensive algorithms already expressed in the code below.

class Slice:

    def __init__(self, data, offset, length):
        self.prefix = data[:offset]
        self.root = data[offset:offset+length]
        self.suffix = data[offset+length:]

    def __eq__(self, other):
        return self.root == other.root

    def __len__(self):
        return len(self.root)

################################################################################

class Match:

    def __init__(self, data, key, prefix_tree, suffix_tree):
        self.data = data
        self.key = key
        self.prefix_tree = prefix_tree
        self.suffix_tree = suffix_tree
        self.__value = len(key) + prefix_tree.value() + suffix_tree.value()

    def value(self):
        return self.__value

################################################################################

class Tree(tuple):

    def __new__(cls, nodes):
        tree = super().__new__(cls, nodes)
        tree.__value = max(map(Match.value, tree)) if tree else 0
        return tree

    def value(self):
        return self.__value

    def find(self, value):
        for index, match in enumerate(self):
            if match.value() == value:
                return index
        raise ValueError()

################################################################################

def search(data, key):
    length = 0
    nodes = []
    for d_block in shrink(data, len(key)):
        block_len = len(d_block)
        if length > block_len:
            return Tree(nodes)
        for k_block in slide(key, block_len):
            if d_block == k_block:
                length = block_len
                prefix_tree = search(d_block.prefix, k_block.prefix)
                suffix_tree = search(d_block.suffix, k_block.suffix)
                match = Match(d_block, k_block, prefix_tree, suffix_tree)
                nodes.append(match)
    return Tree(nodes)

def shrink(data, max_len):
    for length in range(min(len(data), max_len), 0, -1):
        for block in slide(data, length):
            yield block

def slide(data, length):
    for offset in range(len(data) - length + 1):
        yield Slice(data, offset, length)

################################################################################

def build_tree(nodes):
    match = nodes[nodes.find(nodes.value())]
    node = match.key
    if match.prefix_tree:
        node.prefix = build_tree(match.prefix_tree)
    if match.suffix_tree:
        node.suffix = build_tree(match.suffix_tree)
    return node

def flatten_tree(node):
    array = [0]
    _flatten(node, array)
    return tuple(array)

def _flatten(node, array):
    if isinstance(node.prefix, Slice):
        _flatten(node.prefix, array)
    else:
        array.append(node.prefix)
    array[0] += 1
    array.append((array[0], node.root))
    if isinstance(node.suffix, Slice):
        _flatten(node.suffix, array)
    else:
        array.append(node.suffix)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

掀纱窥君容 2024-08-26 16:49:24

“解决这个问题的最佳方法是什么?”

最好的也是唯一的方法是定义这个对象的“含义”以及该对象的长度“含义”。

该对象似乎是一个单词列表。而已。这似乎是 _string 中的值。

目前尚不清楚 _simple 是什么,只是 _string 中无法访问的已过滤单词子集。

那么长度是多少?单词的长度还是过滤子集中单词的长度?

只有您可以定义此类含义。然后,含义将决定如何实现__len__。在定义含义之前,不可能确定应如何实现任何内容。

"What is the best way to go about fixing this problem?"

The best -- and only -- way is to define what this object "means" and what the length of this object "means".

The object appears to be a list of words. Nothing more. That seems to be the value in _string.

It's not clear what _simple is, other than an inaccessible filtered subset of the words in _string.

So what's the length? The length of the words or the length of the words in the filtered subset?

Only you can define what this class means. The meaning will then determine how to implement __len__. Until you define the meaning, it's impossible to determine how anything should be implemented.

眉黛浅 2024-08-26 16:49:24

如果您希望对 String 实例进行迭代以对其 self.__string 进行迭代,正如您的 __iter__ 方法所指示的那样,长度的唯一明智选择也是返回 self.__string 的长度code>__string -- 如果 len(x)sum(1 for _ in x) 结果,这将是真正特殊的在不同的值。

我必须承认我不明白这个类的目的(特别是为什么你做出了旧式的可怕选择,以及为什么你使用这样一种扭曲的方式来构建__simple) ,但无论如何,内部一致性很重要。因此,要么更改 __iter__,要么使 __len__ 与其逻辑兼容。

你的切片逻辑也完全让我无法理解——为什么你构建切片的 __simple 的方式可能与从切片的 __string 重建切片的方式不同>?例如,如果 self.__string 是 '?Boh!'因此 self.__simple 是 'boh',为什么你想要 self[1:-1] 有一个 __string< /code> 为“Boh”,但 __simple 为“o”,因此与通过切片重新计算得到的 __simple 不兼容、不同且不一致...?

我想这与这个关于长度的问题没有密切关系,但我只是对你所做的这些非常奇特的设计选择感到好奇......

If you want iteration on a String instance to iterate on its self.__string, as your __iter__ method indicates, the only sensible choice for length is also to return the length of __string -- it would be truly peculiar if len(x) and sum(1 for _ in x) resulted in different values.

I have to admit I don't understand the purpose of this class (and in particular why you made the terrible choice of having it old-style, and why you use such a contorted way to build __simple), but internal consistency is important anyway. So, either change __iter__, or make __len__ logically compatible with it.

Your slicing logic also totally escapes me -- why are you building the slice's __simple in a way that's likely to be different from what you'd get by rebuilding it from the slice's __string? E.g., if self.__string is '?Boh!' and therefore self.__simple is 'boh', why would you want self[1:-1] to have a __string of 'Boh' but with a __simple of 'o', so incompatible, different, and inconsistent from the __simple you'd get by recomputing it from the slice...?

I guess that's not germane to this Q about length, but I'm just curious about these many, extremely peculiar design choices that you're making...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文