str.replace(..).replace(..) 是 Python 中的标准习惯用法吗?

发布于 2024-08-26 10:19:43 字数 1070 浏览 6 评论 0 原文

例如,假设我想要一个函数来转义字符串以便在 HTML 中使用(如 Django 的 转义过滤器):

    def escape(string):
        """
        Returns the given string with ampersands, quotes and angle 
        brackets encoded.
        """
        return string.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace("'", '&#39;').replace('"', '&quot;')

这可行,但它很快就会变得丑陋,并且算法性能似乎很差(在本例中,字符串被重复遍历 5 次)。更好的是这样的:

    def escape(string):
        """
        Returns the given string with ampersands, quotes and angle 
        brackets encoded.
        """
        # Note that ampersands must be escaped first; the rest can be escaped in 
        # any order.
        return replace_multi(string.replace('&', '&amp;'),
                             {'<': '&lt;', '>': '&gt;', 
                              "'": '&#39;', '"': '&quot;'})

这样的函数是否存在,或者是使用我之前编写的标准 Python 习惯用法?

For instance, say I wanted a function to escape a string for use in HTML (as in Django's escape filter):

    def escape(string):
        """
        Returns the given string with ampersands, quotes and angle 
        brackets encoded.
        """
        return string.replace('&', '&').replace('<', '<').replace('>', '>').replace("'", ''').replace('"', '"')

This works, but it gets ugly quickly and appears to have poor algorithmic performance (in this example, the string is repeatedly traversed 5 times). What would be better is something like this:

    def escape(string):
        """
        Returns the given string with ampersands, quotes and angle 
        brackets encoded.
        """
        # Note that ampersands must be escaped first; the rest can be escaped in 
        # any order.
        return replace_multi(string.replace('&', '&'),
                             {'<': '<', '>': '>', 
                              "'": ''', '"': '"'})

Does such a function exist, or is the standard Python idiom to use what I wrote before?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

明明#如月 2024-09-02 10:19:43

您是否有一个运行速度太慢的应用程序,并且您对其进行分析后发现像这样的代码片段之类的行导致其速度缓慢?瓶颈出现在意想不到的地方。

当前代码片段遍历字符串 5 次,每次执行一件事。您建议遍历一次,可能每次都做五件事(或者至少每次都做一些事情)。目前还不清楚这是否会自动让我的工作做得更好。目前使用的算法是O(n*m)(假设字符串的长度比规则中的内容长),其中n是字符串的长度,m是替换规则的数量。我认为,你可以将算法复杂度降低到 O(n*log(m)) 之类的程度,并且在我们所处的特定情况下,原始事物都只有一个字符(但在多次调用的情况下则不然)一般情况下替换)—O(n),但这并不重要,因为m是5,但n是无界的

如果 m 保持不变,那么两个解决方案的复杂度实际上都是 O(n)。我不清楚尝试将五个简单的传球变成一个复杂的传球是否是一项有价值的任务,目前我无法猜测其实际时间。如果有什么东西可以让它更好地扩展,我会认为这是更有价值的任务。

在一次传递而不是连续传递中完成所有操作还需要回答有关如何处理冲突规则以及如何应用这些规则的问题。通过一系列replace 可以清楚地解决这些问题。

Do you have an application that is running too slow and you profiled it to find that a line like this snippet is causing it to be slow? Bottlenecks occur at unexpected places.

The current snippet traverses the string 5 times, doing one thing each time. You are suggesting traversing it once, probably doing doing five things each time (or at least doing something each time). It isn't clear that this will automatically do a better job to me. Currently the algorithm used is O(n*m) (assuming the length of the string is longer than the stuff in the rules), where n is the length of the string and m is the number of substitution rules. You could, I think, reduce the algorithmic complexity to something like O(n*log(m)) and in the specific case we're in—where the original things are all only one character (but not in the case of multiple calls to replace in general)—O(n), but this doesn't matter since m is 5 but n is unbounded.

If m is held constant, then, the complexity of both solutions really goes to O(n). It is not clear to me that it is going to be a worthy task to try to turn five simple passes into one complex one, the actual time of which I cannot guess at the current moment. If there was something about it that could make it scale better, I would have thought it was much more worthwhile task.

Doing everything on one pass rather than consecutive passes also demands questions be answered about what to do about conflicting rules and how they are applied. The resolution to these questions is clear with a chain of replace.

維他命╮ 2024-09-02 10:19:43

不如我们测试一下各种方法,看看哪种方法效果更快(假设我们只关心最快的方法)。

def escape1(input):
        return input.replace('&', '&').replace('<', '<').replace('>', '>').replace("'", ''').replace('"', '"')

translation_table = {
    '&': '&',
    '<': '<',
    '>': '>',
    "'": ''',
    '"': '"',
}

def escape2(input):
        return ''.join(translation_table.get(char, char) for char in input)

import re
_escape3_re = re.compile(r'[&<>\'"]')
def _escape3_repl(x):
    s = x.group(0)
    return translation_table.get(s, s)
def escape3(x):
    return _escape3_re.sub(_escape3_repl, x)

def escape4(x):
    return unicode(x).translate(translation_table)

test_strings = (
    'Nothing in there.',
    '<this is="not" a="tag" />',
    'Something & Something else',
    'This one is pretty long. ' * 50
)

import time

for test_i, test_string in enumerate(test_strings):
    print repr(test_string)
    for func in escape1, escape2, escape3, escape4:
        start_time = time.time()
        for i in xrange(1000):
            x = func(test_string)
        print '\t%s done in %.3fms' % (func.__name__, (time.time() - start_time))
    print

运行这个命令会给你带来:

'Nothing in there.'
    escape1 done in 0.002ms
    escape2 done in 0.009ms
    escape3 done in 0.001ms
    escape4 done in 0.005ms

'<this is="not" a="tag" />'
    escape1 done in 0.002ms
    escape2 done in 0.012ms
    escape3 done in 0.009ms
    escape4 done in 0.007ms

'Something & Something else'
    escape1 done in 0.002ms
    escape2 done in 0.012ms
    escape3 done in 0.003ms
    escape4 done in 0.007ms

'This one is pretty long. <snip>'
    escape1 done in 0.008ms
    escape2 done in 0.386ms
    escape3 done in 0.011ms
    escape4 done in 0.310ms

看起来只是一个接一个地替换它们是最快的。

编辑:再次运行测试 1000000 次迭代,给出前三个字符串的以下结果(第四个字符串在我的机器上需要太长时间等待 =P):

'Nothing in there.'
    escape1 done in 0.001ms
    escape2 done in 0.008ms
    escape3 done in 0.002ms
    escape4 done in 0.005ms

'<this is="not" a="tag" />'
    escape1 done in 0.002ms
    escape2 done in 0.011ms
    escape3 done in 0.009ms
    escape4 done in 0.007ms

'Something & Something else'
    escape1 done in 0.002ms
    escape2 done in 0.011ms
    escape3 done in 0.003ms
    escape4 done in 0.007ms

数字几乎相同。在第一种情况下,它们实际上更加一致,因为直接字符串替换现在是最快的。

How about we just test various ways of doing this and see which comes out faster (assuming we are only caring about the fastest way to do it).

def escape1(input):
        return input.replace('&', '&').replace('<', '<').replace('>', '>').replace("'", ''').replace('"', '"')

translation_table = {
    '&': '&',
    '<': '<',
    '>': '>',
    "'": ''',
    '"': '"',
}

def escape2(input):
        return ''.join(translation_table.get(char, char) for char in input)

import re
_escape3_re = re.compile(r'[&<>\'"]')
def _escape3_repl(x):
    s = x.group(0)
    return translation_table.get(s, s)
def escape3(x):
    return _escape3_re.sub(_escape3_repl, x)

def escape4(x):
    return unicode(x).translate(translation_table)

test_strings = (
    'Nothing in there.',
    '<this is="not" a="tag" />',
    'Something & Something else',
    'This one is pretty long. ' * 50
)

import time

for test_i, test_string in enumerate(test_strings):
    print repr(test_string)
    for func in escape1, escape2, escape3, escape4:
        start_time = time.time()
        for i in xrange(1000):
            x = func(test_string)
        print '\t%s done in %.3fms' % (func.__name__, (time.time() - start_time))
    print

Running this gives you:

'Nothing in there.'
    escape1 done in 0.002ms
    escape2 done in 0.009ms
    escape3 done in 0.001ms
    escape4 done in 0.005ms

'<this is="not" a="tag" />'
    escape1 done in 0.002ms
    escape2 done in 0.012ms
    escape3 done in 0.009ms
    escape4 done in 0.007ms

'Something & Something else'
    escape1 done in 0.002ms
    escape2 done in 0.012ms
    escape3 done in 0.003ms
    escape4 done in 0.007ms

'This one is pretty long. <snip>'
    escape1 done in 0.008ms
    escape2 done in 0.386ms
    escape3 done in 0.011ms
    escape4 done in 0.310ms

Looks like just replacing them one after another goes the fastest.

Edit: Running the tests again with 1000000 iterations gives the following for the first three strings (the fourth would take too long on my machine for me to wait =P):

'Nothing in there.'
    escape1 done in 0.001ms
    escape2 done in 0.008ms
    escape3 done in 0.002ms
    escape4 done in 0.005ms

'<this is="not" a="tag" />'
    escape1 done in 0.002ms
    escape2 done in 0.011ms
    escape3 done in 0.009ms
    escape4 done in 0.007ms

'Something & Something else'
    escape1 done in 0.002ms
    escape2 done in 0.011ms
    escape3 done in 0.003ms
    escape4 done in 0.007ms

The numbers are pretty much the same. In the first case they are actually even more consistent as the direct string replacement is fastest now.

泅人 2024-09-02 10:19:43

我更喜欢干净的东西,比如:

substitutions = [
    ('<', '<'),
    ('>', '>'),
    ...]

for search, replacement in substitutions:
    string = string.replace(search, replacement)

I prefer something clean like:

substitutions = [
    ('<', '<'),
    ('>', '>'),
    ...]

for search, replacement in substitutions:
    string = string.replace(search, replacement)
如日中天 2024-09-02 10:19:43

您可以使用减少:

reduce(lambda s,r: s.replace(*r),
       [('&', '&'),
        ('<', '<'),
        ('>', '>'),
        ("'", '''),
        ('"', '"')],
       string)

You can use reduce:

reduce(lambda s,r: s.replace(*r),
       [('&', '&'),
        ('<', '<'),
        ('>', '>'),
        ("'", '''),
        ('"', '"')],
       string)
悲歌长辞 2024-09-02 10:19:43

这就是Django 会

def escape(html):
    """Returns the given HTML with ampersands, quotes and carets encoded."""
    return mark_safe(force_unicode(html).replace('&', '&').replace('<', '<').replace('>', '>').replace('"', '"').replace("'", '''))

That's what Django does:

def escape(html):
    """Returns the given HTML with ampersands, quotes and carets encoded."""
    return mark_safe(force_unicode(html).replace('&', '&').replace('<', '<').replace('>', '>').replace('"', '"').replace("'", '''))
-黛色若梦 2024-09-02 10:19:43

根据bebraw的建议,这就是我最终使用的内容(当然,在一个单独的模块中):

import re

class Subs(object):
    """
    A container holding strings to be searched for and replaced in
    replace_multi().

    Holds little relation to the sandwich.
    """
    def __init__(self, needles_and_replacements):
        """
        Returns a new instance of the Subs class, given a dictionary holding 
        the keys to be searched for and the values to be used as replacements.
        """
        self.lookup = needles_and_replacements
        self.regex = re.compile('|'.join(map(re.escape,
                                             needles_and_replacements)))

def replace_multi(string, subs):
    """
    Replaces given items in string efficiently in a single-pass.

    "string" should be the string to be searched.
    "subs" can be either:
        A.) a dictionary containing as its keys the items to be
            searched for and as its values the items to be replaced.
        or B.) a pre-compiled instance of the Subs class from this module
               (which may have slightly better performance if this is
                called often).
    """
    if not isinstance(subs, Subs): # Assume dictionary if not our class.
        subs = Subs(subs)
    lookup = subs.lookup
    return subs.regex.sub(lambda match: lookup[match.group(0)], string)

示例用法:

def escape(string):
    """
    Returns the given string with ampersands, quotes and angle 
    brackets encoded.
    """
    # Note that ampersands must be escaped first; the rest can be escaped in 
    # any order.
    escape.subs = Subs({'<': '<', '>': '>', "'": ''', '"': '"'})
    return replace_multi(string.replace('&', '&'), escape.subs)

好多了:)。感谢您的帮助。

编辑

没关系,迈克·格雷厄姆是对的。我对它进行了基准测试,结果发现替换实际上要慢得多。

代码:

from urllib2 import urlopen
import timeit

def escape1(string):
    """
    Returns the given string with ampersands, quotes and angle
    brackets encoded.
    """
    return string.replace('&', '&').replace('<', '<').replace('>', '>').replace("'", ''').replace('"', '"')

def escape2(string):
    """
    Returns the given string with ampersands, quotes and angle
    brackets encoded.
    """
    # Note that ampersands must be escaped first; the rest can be escaped in
    # any order.
    escape2.subs = Subs({'<': '<', '>': '>', "'": ''', '"': '"'})
    return replace_multi(string.replace('&', '&'), escape2.subs)

# An example test on the stackoverflow homepage.
request = urlopen('http://stackoverflow.com')
test_string = request.read()
request.close()

test1 = timeit.Timer('escape1(test_string)',
                     setup='from __main__ import escape1, test_string')
test2 = timeit.Timer('escape2(test_string)',
                     setup='from __main__ import escape2, test_string')
print 'multi-pass:', test1.timeit(2000)
print 'single-pass:', test2.timeit(2000)

输出:

multi-pass: 15.9897229671
single-pass: 66.5422530174

就这么多。

In accordance with bebraw's suggestion, here is what I ended up using (in a separate module, of course):

import re

class Subs(object):
    """
    A container holding strings to be searched for and replaced in
    replace_multi().

    Holds little relation to the sandwich.
    """
    def __init__(self, needles_and_replacements):
        """
        Returns a new instance of the Subs class, given a dictionary holding 
        the keys to be searched for and the values to be used as replacements.
        """
        self.lookup = needles_and_replacements
        self.regex = re.compile('|'.join(map(re.escape,
                                             needles_and_replacements)))

def replace_multi(string, subs):
    """
    Replaces given items in string efficiently in a single-pass.

    "string" should be the string to be searched.
    "subs" can be either:
        A.) a dictionary containing as its keys the items to be
            searched for and as its values the items to be replaced.
        or B.) a pre-compiled instance of the Subs class from this module
               (which may have slightly better performance if this is
                called often).
    """
    if not isinstance(subs, Subs): # Assume dictionary if not our class.
        subs = Subs(subs)
    lookup = subs.lookup
    return subs.regex.sub(lambda match: lookup[match.group(0)], string)

Example usage:

def escape(string):
    """
    Returns the given string with ampersands, quotes and angle 
    brackets encoded.
    """
    # Note that ampersands must be escaped first; the rest can be escaped in 
    # any order.
    escape.subs = Subs({'<': '<', '>': '>', "'": ''', '"': '"'})
    return replace_multi(string.replace('&', '&'), escape.subs)

Much better :). Thanks for the help.

Edit

Nevermind, Mike Graham was right. I benchmarked it and the replacement ends up actually being much slower.

Code:

from urllib2 import urlopen
import timeit

def escape1(string):
    """
    Returns the given string with ampersands, quotes and angle
    brackets encoded.
    """
    return string.replace('&', '&').replace('<', '<').replace('>', '>').replace("'", ''').replace('"', '"')

def escape2(string):
    """
    Returns the given string with ampersands, quotes and angle
    brackets encoded.
    """
    # Note that ampersands must be escaped first; the rest can be escaped in
    # any order.
    escape2.subs = Subs({'<': '<', '>': '>', "'": ''', '"': '"'})
    return replace_multi(string.replace('&', '&'), escape2.subs)

# An example test on the stackoverflow homepage.
request = urlopen('http://stackoverflow.com')
test_string = request.read()
request.close()

test1 = timeit.Timer('escape1(test_string)',
                     setup='from __main__ import escape1, test_string')
test2 = timeit.Timer('escape2(test_string)',
                     setup='from __main__ import escape2, test_string')
print 'multi-pass:', test1.timeit(2000)
print 'single-pass:', test2.timeit(2000)

Output:

multi-pass: 15.9897229671
single-pass: 66.5422530174

So much for that.

玩世 2024-09-02 10:19:43

显然,通过正则表达式实现这一点很常见。您可以在 ASPN此处

Apparently it's pretty common to implement that via regex. You can find an example of this at ASPN and here.

吃颗糖壮壮胆 2024-09-02 10:19:43

好吧,我坐下来算了算。请不要生我的气,我的回答专门讨论了 ΤΖΩΤΖIΟΥ 的解决方案,但这有点难以硬塞在评论中,所以让我这样做。事实上,我还将提出一些与OP问题相关的考虑。

首先,我一直在与 ΤΖΩΤΖIΟΥ 讨论他的方法的优雅性、正确性和可行性。事实证明,它看起来像该提案,虽然它确实使用(本质上无序的)字典作为存储替换对的寄存器,但实际上确实始终返回正确的结果,而我声称它不会。这是因为下面第 11 行中对 itertools.starmap() 的调用将单个字符/字节对(稍后详细介绍)上的迭代器作为其第二个参数,该迭代器看起来像 [('h','h',),('e','e',),('l','l',),...]。这些字符/字节对是重复调用第一个参数 replacer.get 的内容。不可能遇到这样的情况:首先将 '>' 转换为 '>',然后无意中再次转换为 '& amp;gt;',因为每个字符/字节仅被考虑一次进行替换。所以这部分原则上是好的并且算法上是正确的。

下一个问题是可行性,其中包括对性能的考察。如果使用笨拙的代码在 0.01 秒内正确完成一项重要任务,而使用出色的代码则在 1 秒内正确完成,那么在实践中,尴尬可能会被认为是更好的选择(但前提是 1 秒的损失实际上是无法容忍的)。这是我用于测试的代码;它包括许多不同的实现。它是用 python 3.1 编写的,因此我们可以使用 unicode 希腊字母作为标识符,这本身就很棒(py3k 中的 zip 返回与 py2 中的 itertools.izip 相同)

import itertools                                                                  #01
                                                                                  #02
_replacements = {                                                                 #03
  '&': '&',                                                                   #04
  '<': '<',                                                                    #05
  '>': '>',                                                                    #06
  '"': '"',                                                                  #07
  "'": ''', }                                                                 #08
                                                                                  #09
def escape_ΤΖΩΤΖΙΟΥ( a_string ):                                                  #10
  return ''.join(                                                                 #11
    itertools.starmap(                                                            #12
      _replacements.get,                                                          #13
      zip( a_string, a_string ) ) )                                               #14
                                                                                  #15
def escape_SIMPLE( text ):                                                        #16
  return ''.join( _replacements.get( chr, chr ) for chr in text )                 #17
                                                                                  #18
def escape_SIMPLE_optimized( text ):                                              #19
  get = _replacements.get                                                         #20
  return ''.join( get( chr, chr ) for chr in text )                               #21
                                                                                  #22
def escape_TRADITIONAL( text ):                                                   #23
  return text.replace('&', '&').replace('<', '<').replace('>', '>')\    #24
    .replace("'", ''').replace('"', '"')                                 #25

:是计时结果:

escaping with SIMPLE            took 5.74664253sec for 100000 items
escaping with SIMPLE_optimized  took 5.11457801sec for 100000 items
escaping TRADITIONAL in-situ    took 0.57543013sec for 100000 items
escaping with TRADITIONAL       took 0.62347413sec for 100000 items
escaping a la ΤΖΩΤΖΙΟΥ          took 2.66592320sec for 100000 items

事实证明,原始发布者担心“传统”方法“很快就会变得丑陋并且算法性能似乎很差”,当放入这种背景下时,这似乎是部分没有根据的。它实际上表现最好;当隐藏到函数调用中时,我们确实会看到 8% 的性能损失(“调用方法的成本很高”,但一般来说您仍然应该这样做)。相比之下,ΤΖΩΤΖIΟY 的实现时间大约是传统方法的 5 倍,考虑到它的复杂性更高,必须与 Python 长期磨练的优化字符串方法竞争,这并不奇怪。

这里还有另一种算法,即简单算法。据我所知,这与 ΤΖΩΤΖIOY 的方法的作用非常相似:它迭代文本中的字符/字节并对每个字符/字节执行查找,然后将所有字符/字节连接在一起并返回生成的转义文本。您可以看到,如果一种方法涉及相当冗长且神秘的公式,那么简单的实现实际上是一目了然的。

然而,真正让我困惑的是 SIMPLE 方法的性能有多糟糕:它比传统方法慢大约 10 倍,也比 ΤΖΩΤΖlOY 方法慢两倍。我在这里完全不知所措,也许有人可以想出为什么会这样。它只使用Python最基本的构建块,并使用两个隐式迭代,因此它避免构建废弃列表和所有内容,但它仍然很慢,我不知道为什么。

让我通过对 ΤΖΩΤΖIΟΥ 解决方案的优点的评论来结束本次代码审查。我已经说得很清楚了,我发现代码很难阅读,而且对于手头的任务来说过于夸张。然而,更重要的是,我发现他对待字符的方式并确保对于给定的小范围字符,它们将以类似于字节的方式表现,这有点令人恼火。确保它适用于手头的任务,但是一旦我迭代例如字节串“ΤΖΩΤΖιΟΥ”,我所做的就是迭代表示单个字符的相邻字节。在大多数情况下,这正是您应该避免的;这正是为什么在 py3k 中“字符串”现在是旧的“unicode 对象”,而旧的“字符串”变成了“字节”和“字节数组”的原因。如果我要提名 py3k 的一个功能,可以保证将代码从 2 系列迁移到 3 系列,这可能会很昂贵,那么这就是 py3k 的这个单一属性。从那时起,我所有的编码问题中 98% 就迎刃而解了,而且没有什么聪明的技巧可以让我严重怀疑我的举动。所说的算法不是“概念上 8 位干净且 unicode 安全”,这对我来说是一个严重的缺点,因为现在是 2010 年。

ok so i sat down and did the math. pls do not get mad at me i answer specifically discussing ΤΖΩΤΖΙΟΥ’s solution, but this would be somewhat hard to shoehorn inside a comment, so let me do it this way. i will, in fact, also air some considerations that are relevant to the OP’s question.

first up, i have been discussing with ΤΖΩΤΖΙΟΥ the elegance, correctness, and viability of his approach. turns out it looks like the proposal, while it does use an (inherently unordered) dictionary as a register to store the substitution pairs, does in fact consistently return correct results, where i had claimed it wouldn’t. this is because the call to itertools.starmap() in line 11, below, gets as its second argument an iterator over pairs of single characters/bytes (more on that later) that looks like [ ( 'h', 'h', ), ( 'e', 'e', ), ( 'l', 'l', ), ... ]. these pairs of characters/bytes is what the first argument, replacer.get, is repeatedly called with. there is not a chance to run into a situation where first '>' is transformed into '>' and then inadvertently again into '&gt;', because each character/byte is considered only once for substitution. so this part is in principle fine and algorithmically correct.

the next question is viability, and that would include a look at performance. if a vital task gets correctly completed in 0.01s using an awkward code but 1s using awesome code, then awkward might be considered preferable in practice (but only if the 1 second loss is in fact intolerable). here is the code i used for testing; it includes a number of different implementations. it is written in python 3.1 so we can use unicode greek letters for identifiers which in itself is awesome (zip in py3k returns the same as itertools.izip in py2):

import itertools                                                                  #01
                                                                                  #02
_replacements = {                                                                 #03
  '&': '&',                                                                   #04
  '<': '<',                                                                    #05
  '>': '>',                                                                    #06
  '"': '"',                                                                  #07
  "'": ''', }                                                                 #08
                                                                                  #09
def escape_ΤΖΩΤΖΙΟΥ( a_string ):                                                  #10
  return ''.join(                                                                 #11
    itertools.starmap(                                                            #12
      _replacements.get,                                                          #13
      zip( a_string, a_string ) ) )                                               #14
                                                                                  #15
def escape_SIMPLE( text ):                                                        #16
  return ''.join( _replacements.get( chr, chr ) for chr in text )                 #17
                                                                                  #18
def escape_SIMPLE_optimized( text ):                                              #19
  get = _replacements.get                                                         #20
  return ''.join( get( chr, chr ) for chr in text )                               #21
                                                                                  #22
def escape_TRADITIONAL( text ):                                                   #23
  return text.replace('&', '&').replace('<', '<').replace('>', '>')\    #24
    .replace("'", ''').replace('"', '"')                                 #25

these are the timing results:

escaping with SIMPLE            took 5.74664253sec for 100000 items
escaping with SIMPLE_optimized  took 5.11457801sec for 100000 items
escaping TRADITIONAL in-situ    took 0.57543013sec for 100000 items
escaping with TRADITIONAL       took 0.62347413sec for 100000 items
escaping a la ΤΖΩΤΖΙΟΥ          took 2.66592320sec for 100000 items

turns out the original poster’s concern that the ‘traditional’ method gets ‘ugly quickly and appears to have poor algorithmic performance’ appears partially unwarranted when put into this context. it actually performs best; when stashed away into a function call, we do get to see a 8% performance penalty (‘calling methods is expensive’, but in general you should still do it). in comparison, ΤΖΩΤΖΙΟΥ’s implementation takes around 5 times as long as the traditional method, which, given it’s higher complexity that has to compete with python’s long-honed, optimized string methods is no surprise.

there is yet another algorithm here, the SIMPLE one. as far as i can see, this very much does exactly what ΤΖΩΤΖΙΟΥ’s method does: it iterates over the characters/bytes in the text and performs a lookup for each, then joins all the characters/bytes together and returns the resulting escaped text. you can see that where one way to do that involves a fairly lengthy and myterious formulation, the SIMPLE implementation is actually understandable at a glance.

what really trips me up here, though, is how badly the SIMPLE approach is in performance: it is around 10 times as slow as the traditional one, and also twice as slow as ΤΖΩΤΖΙΟΥ’s method. i am completely at a loss here, maybe someone can come up with an idea why this should be so. it uses only the most basic building blocks of python and works with two implicit iterations, so it avoids to build throw-away lists and everything, but it still slow, and i don’t know why.

let me conclude this code review with a remark on the merit of ΤΖΩΤΖΙΟΥ’s solution. i have made it sufficiently clear i find the code hard to read and too overblown for the task at hand. more critical than that, however, i find the way he treats characters and makes sure that for a given small range of characters they will behave in a byte-like fashion a little irritating. sure it works for the task at hand, but as soon as i iterate e.g. over the bytestring 'ΤΖΩΤΖΙΟΥ' what i do is iterate over adjacent bytes representing single characters. in most situations this is exactly what you should avoid; this is precisely the reason why in py3k ‘strings’ are now the ‘unicode objects’ of old, and the ‘strings’ of old have become ‘bytes’ and ‘bytearray’. if i was to nominate the one feature of py3k that could warrant a possibly expensive migration of code from the 2 series to the 3 series, it would be this single property of py3k. 98% of all my encoding issues have just dissolved ever since, period, and there is no clever hack that could have me seriously doubt my move. said algorithm is not ‘conceptually 8bit-clean and unicode safe’, which to me is a seriously shortcome, given this is 2010.

治碍 2024-09-02 10:19:43

如果您使用非 Unicode 字符串并且 Python < 3.0,尝试另一种 translate 方法:

# Python < 3.0
import itertools

def escape(a_string):
    replacer= dict( (chr(c),chr(c)) for c in xrange(256))
    replacer.update(
        {'&': '&',
         '<': '<',
         '>': '>',
         '"': '"',
         "'": '''}
    )
    return ''.join(itertools.imap(replacer.__getitem__, a_string))

if __name__ == "__main__":
    print escape('''"Hello"<i> to George's friend&co.''')

$ python so2484156.py 
"Hello"<i> to George's friend&co.

这更接近于输入字符串的“单次扫描”,根据您的意愿。

编辑

我的目的是创建一个不限于单字符替换的 unicode.translate 等效项,所以我想出了上面的答案;我收到了用户“flow”的评论,该评论几乎完全脱离了上下文,只有一个正确的点:上面的代码原样是用于处理字节字符串而不是unicode字符串。有一个明显的更新(即 unichr() … xrange(sys.maxunicode+1)),我非常不喜欢,所以我想出了另一个既适用于 unicode 又适用于字节字符串的函数,前提是 Python 保证:

all( (chr(i)==unichr(i) and hash(chr(i))==hash(unichr(i)))
    for i in xrange(128)) is True

新函数如下:

def escape(a_string):
    replacer= {
        '&': '&',
        '<': '<',
        '>': '>',
        '"': '"',
        "'": ''',
    }
    return ''.join(
        itertools.starmap(
            replacer.get, # .setdefault *might* be faster
            itertools.izip(a_string, a_string)
        )
    )

注意星图与元组序列的使用:对于不在替换字典中的任何字符,返回所述字符。

If you work with non-Unicode strings and Python < 3.0, try an alternate translate method:

# Python < 3.0
import itertools

def escape(a_string):
    replacer= dict( (chr(c),chr(c)) for c in xrange(256))
    replacer.update(
        {'&': '&',
         '<': '<',
         '>': '>',
         '"': '"',
         "'": '''}
    )
    return ''.join(itertools.imap(replacer.__getitem__, a_string))

if __name__ == "__main__":
    print escape('''"Hello"<i> to George's friend&co.''')

$ python so2484156.py 
"Hello"<i> to George's friend&co.

This is closer to a "single scan" of the input string, as per your wish.

EDIT

My intention was to create a unicode.translate equivalent that was not restricted to single-character replacements, so I came up with the answer above; I got a comment by user "flow" that was almost completely out of context, with a single correct point: the code above, as is, is intended to work with byte strings and not unicode strings. There is an obvious update (i.e. unichr() … xrange(sys.maxunicode+1)) which I strongly dislike, so I came up with another function that works on both unicode and byte strings, given that Python guarantees:

all( (chr(i)==unichr(i) and hash(chr(i))==hash(unichr(i)))
    for i in xrange(128)) is True

The new function follows:

def escape(a_string):
    replacer= {
        '&': '&',
        '<': '<',
        '>': '>',
        '"': '"',
        "'": ''',
    }
    return ''.join(
        itertools.starmap(
            replacer.get, # .setdefault *might* be faster
            itertools.izip(a_string, a_string)
        )
    )

Notice the use of starmap with a sequence of tuples: for any character not in the replacer dict, return said character.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文