str.replace(..).replace(..) 是 Python 中的标准习惯用法吗?
例如,假设我想要一个函数来转义字符串以便在 HTML 中使用(如 Django 的 转义过滤器):
def escape(string):
"""
Returns the given string with ampersands, quotes and angle
brackets encoded.
"""
return string.replace('&', '&').replace('<', '<').replace('>', '>').replace("'", ''').replace('"', '"')
这可行,但它很快就会变得丑陋,并且算法性能似乎很差(在本例中,字符串被重复遍历 5 次)。更好的是这样的:
def escape(string):
"""
Returns the given string with ampersands, quotes and angle
brackets encoded.
"""
# Note that ampersands must be escaped first; the rest can be escaped in
# any order.
return replace_multi(string.replace('&', '&'),
{'<': '<', '>': '>',
"'": ''', '"': '"'})
这样的函数是否存在,或者是使用我之前编写的标准 Python 习惯用法?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
您是否有一个运行速度太慢的应用程序,并且您对其进行分析后发现像这样的代码片段之类的行导致其速度缓慢?瓶颈出现在意想不到的地方。
当前代码片段遍历字符串 5 次,每次执行一件事。您建议遍历一次,可能每次都做五件事(或者至少每次都做一些事情)。目前还不清楚这是否会自动让我的工作做得更好。目前使用的算法是O(n*m)(假设字符串的长度比规则中的内容长),其中n是字符串的长度,m是替换规则的数量。我认为,你可以将算法复杂度降低到 O(n*log(m)) 之类的程度,并且在我们所处的特定情况下,原始事物都只有一个字符(但在多次调用的情况下则不然)一般情况下替换)—O(n),但这并不重要,因为m是5,但n是无界的。
如果 m 保持不变,那么两个解决方案的复杂度实际上都是 O(n)。我不清楚尝试将五个简单的传球变成一个复杂的传球是否是一项有价值的任务,目前我无法猜测其实际时间。如果有什么东西可以让它更好地扩展,我会认为这是更有价值的任务。
在一次传递而不是连续传递中完成所有操作还需要回答有关如何处理冲突规则以及如何应用这些规则的问题。通过一系列
replace
可以清楚地解决这些问题。Do you have an application that is running too slow and you profiled it to find that a line like this snippet is causing it to be slow? Bottlenecks occur at unexpected places.
The current snippet traverses the string 5 times, doing one thing each time. You are suggesting traversing it once, probably doing doing five things each time (or at least doing something each time). It isn't clear that this will automatically do a better job to me. Currently the algorithm used is O(n*m) (assuming the length of the string is longer than the stuff in the rules), where n is the length of the string and m is the number of substitution rules. You could, I think, reduce the algorithmic complexity to something like O(n*log(m)) and in the specific case we're in—where the original things are all only one character (but not in the case of multiple calls to
replace
in general)—O(n), but this doesn't matter since m is 5 but n is unbounded.If m is held constant, then, the complexity of both solutions really goes to O(n). It is not clear to me that it is going to be a worthy task to try to turn five simple passes into one complex one, the actual time of which I cannot guess at the current moment. If there was something about it that could make it scale better, I would have thought it was much more worthwhile task.
Doing everything on one pass rather than consecutive passes also demands questions be answered about what to do about conflicting rules and how they are applied. The resolution to these questions is clear with a chain of
replace
.不如我们测试一下各种方法,看看哪种方法效果更快(假设我们只关心最快的方法)。
运行这个命令会给你带来:
看起来只是一个接一个地替换它们是最快的。
编辑:再次运行测试 1000000 次迭代,给出前三个字符串的以下结果(第四个字符串在我的机器上需要太长时间等待 =P):
数字几乎相同。在第一种情况下,它们实际上更加一致,因为直接字符串替换现在是最快的。
How about we just test various ways of doing this and see which comes out faster (assuming we are only caring about the fastest way to do it).
Running this gives you:
Looks like just replacing them one after another goes the fastest.
Edit: Running the tests again with 1000000 iterations gives the following for the first three strings (the fourth would take too long on my machine for me to wait =P):
The numbers are pretty much the same. In the first case they are actually even more consistent as the direct string replacement is fastest now.
我更喜欢干净的东西,比如:
I prefer something clean like:
您可以使用减少:
You can use reduce:
这就是Django 会:
That's what Django does:
根据bebraw的建议,这就是我最终使用的内容(当然,在一个单独的模块中):
示例用法:
好多了:)。感谢您的帮助。
编辑
没关系,迈克·格雷厄姆是对的。我对它进行了基准测试,结果发现替换实际上要慢得多。
代码:
输出:
就这么多。
In accordance with bebraw's suggestion, here is what I ended up using (in a separate module, of course):
Example usage:
Much better :). Thanks for the help.
Edit
Nevermind, Mike Graham was right. I benchmarked it and the replacement ends up actually being much slower.
Code:
Output:
So much for that.
显然,通过正则表达式实现这一点很常见。您可以在 ASPN 和 此处。
Apparently it's pretty common to implement that via regex. You can find an example of this at ASPN and here.
好吧,我坐下来算了算。请不要生我的气,我的回答专门讨论了 ΤΖΩΤΖIΟΥ 的解决方案,但这有点难以硬塞在评论中,所以让我这样做。事实上,我还将提出一些与OP问题相关的考虑。
首先,我一直在与 ΤΖΩΤΖIΟΥ 讨论他的方法的优雅性、正确性和可行性。事实证明,它看起来像该提案,虽然它确实使用(本质上无序的)字典作为存储替换对的寄存器,但实际上确实始终返回正确的结果,而我声称它不会。这是因为下面第 11 行中对 itertools.starmap() 的调用将单个字符/字节对(稍后详细介绍)上的迭代器作为其第二个参数,该迭代器看起来像
[('h','h',),('e','e',),('l','l',),...]
。这些字符/字节对是重复调用第一个参数replacer.get
的内容。不可能遇到这样的情况:首先将'>'
转换为'>'
,然后无意中再次转换为'& amp;gt;'
,因为每个字符/字节仅被考虑一次进行替换。所以这部分原则上是好的并且算法上是正确的。下一个问题是可行性,其中包括对性能的考察。如果使用笨拙的代码在 0.01 秒内正确完成一项重要任务,而使用出色的代码则在 1 秒内正确完成,那么在实践中,尴尬可能会被认为是更好的选择(但前提是 1 秒的损失实际上是无法容忍的)。这是我用于测试的代码;它包括许多不同的实现。它是用 python 3.1 编写的,因此我们可以使用 unicode 希腊字母作为标识符,这本身就很棒(py3k 中的
zip
返回与 py2 中的itertools.izip
相同):是计时结果:
事实证明,原始发布者担心“传统”方法“很快就会变得丑陋并且算法性能似乎很差”,当放入这种背景下时,这似乎是部分没有根据的。它实际上表现最好;当隐藏到函数调用中时,我们确实会看到 8% 的性能损失(“调用方法的成本很高”,但一般来说您仍然应该这样做)。相比之下,ΤΖΩΤΖIΟY 的实现时间大约是传统方法的 5 倍,考虑到它的复杂性更高,必须与 Python 长期磨练的优化字符串方法竞争,这并不奇怪。
这里还有另一种算法,即简单算法。据我所知,这与 ΤΖΩΤΖIOY 的方法的作用非常相似:它迭代文本中的字符/字节并对每个字符/字节执行查找,然后将所有字符/字节连接在一起并返回生成的转义文本。您可以看到,如果一种方法涉及相当冗长且神秘的公式,那么简单的实现实际上是一目了然的。
然而,真正让我困惑的是 SIMPLE 方法的性能有多糟糕:它比传统方法慢大约 10 倍,也比 ΤΖΩΤΖlOY 方法慢两倍。我在这里完全不知所措,也许有人可以想出为什么会这样。它只使用Python最基本的构建块,并使用两个隐式迭代,因此它避免构建废弃列表和所有内容,但它仍然很慢,我不知道为什么。
让我通过对 ΤΖΩΤΖIΟΥ 解决方案的优点的评论来结束本次代码审查。我已经说得很清楚了,我发现代码很难阅读,而且对于手头的任务来说过于夸张。然而,更重要的是,我发现他对待字符的方式并确保对于给定的小范围字符,它们将以类似于字节的方式表现,这有点令人恼火。确保它适用于手头的任务,但是一旦我迭代例如字节串“ΤΖΩΤΖιΟΥ”,我所做的就是迭代表示单个字符的相邻字节。在大多数情况下,这正是您应该避免的;这正是为什么在 py3k 中“字符串”现在是旧的“unicode 对象”,而旧的“字符串”变成了“字节”和“字节数组”的原因。如果我要提名 py3k 的一个功能,可以保证将代码从 2 系列迁移到 3 系列,这可能会很昂贵,那么这就是 py3k 的这个单一属性。从那时起,我所有的编码问题中 98% 就迎刃而解了,而且没有什么聪明的技巧可以让我严重怀疑我的举动。所说的算法不是“概念上 8 位干净且 unicode 安全”,这对我来说是一个严重的缺点,因为现在是 2010 年。
ok so i sat down and did the math. pls do not get mad at me i answer specifically discussing ΤΖΩΤΖΙΟΥ’s solution, but this would be somewhat hard to shoehorn inside a comment, so let me do it this way. i will, in fact, also air some considerations that are relevant to the OP’s question.
first up, i have been discussing with ΤΖΩΤΖΙΟΥ the elegance, correctness, and viability of his approach. turns out it looks like the proposal, while it does use an (inherently unordered) dictionary as a register to store the substitution pairs, does in fact consistently return correct results, where i had claimed it wouldn’t. this is because the call to
itertools.starmap()
in line 11, below, gets as its second argument an iterator over pairs of single characters/bytes (more on that later) that looks like[ ( 'h', 'h', ), ( 'e', 'e', ), ( 'l', 'l', ), ... ]
. these pairs of characters/bytes is what the first argument,replacer.get
, is repeatedly called with. there is not a chance to run into a situation where first'>'
is transformed into'>'
and then inadvertently again into'>'
, because each character/byte is considered only once for substitution. so this part is in principle fine and algorithmically correct.the next question is viability, and that would include a look at performance. if a vital task gets correctly completed in 0.01s using an awkward code but 1s using awesome code, then awkward might be considered preferable in practice (but only if the 1 second loss is in fact intolerable). here is the code i used for testing; it includes a number of different implementations. it is written in python 3.1 so we can use unicode greek letters for identifiers which in itself is awesome (
zip
in py3k returns the same asitertools.izip
in py2):these are the timing results:
turns out the original poster’s concern that the ‘traditional’ method gets ‘ugly quickly and appears to have poor algorithmic performance’ appears partially unwarranted when put into this context. it actually performs best; when stashed away into a function call, we do get to see a 8% performance penalty (‘calling methods is expensive’, but in general you should still do it). in comparison, ΤΖΩΤΖΙΟΥ’s implementation takes around 5 times as long as the traditional method, which, given it’s higher complexity that has to compete with python’s long-honed, optimized string methods is no surprise.
there is yet another algorithm here, the SIMPLE one. as far as i can see, this very much does exactly what ΤΖΩΤΖΙΟΥ’s method does: it iterates over the characters/bytes in the text and performs a lookup for each, then joins all the characters/bytes together and returns the resulting escaped text. you can see that where one way to do that involves a fairly lengthy and myterious formulation, the SIMPLE implementation is actually understandable at a glance.
what really trips me up here, though, is how badly the SIMPLE approach is in performance: it is around 10 times as slow as the traditional one, and also twice as slow as ΤΖΩΤΖΙΟΥ’s method. i am completely at a loss here, maybe someone can come up with an idea why this should be so. it uses only the most basic building blocks of python and works with two implicit iterations, so it avoids to build throw-away lists and everything, but it still slow, and i don’t know why.
let me conclude this code review with a remark on the merit of ΤΖΩΤΖΙΟΥ’s solution. i have made it sufficiently clear i find the code hard to read and too overblown for the task at hand. more critical than that, however, i find the way he treats characters and makes sure that for a given small range of characters they will behave in a byte-like fashion a little irritating. sure it works for the task at hand, but as soon as i iterate e.g. over the bytestring 'ΤΖΩΤΖΙΟΥ' what i do is iterate over adjacent bytes representing single characters. in most situations this is exactly what you should avoid; this is precisely the reason why in py3k ‘strings’ are now the ‘unicode objects’ of old, and the ‘strings’ of old have become ‘bytes’ and ‘bytearray’. if i was to nominate the one feature of py3k that could warrant a possibly expensive migration of code from the 2 series to the 3 series, it would be this single property of py3k. 98% of all my encoding issues have just dissolved ever since, period, and there is no clever hack that could have me seriously doubt my move. said algorithm is not ‘conceptually 8bit-clean and unicode safe’, which to me is a seriously shortcome, given this is 2010.
如果您使用非 Unicode 字符串并且 Python < 3.0,尝试另一种
translate
方法:这更接近于输入字符串的“单次扫描”,根据您的意愿。
编辑
我的目的是创建一个不限于单字符替换的
unicode.translate
等效项,所以我想出了上面的答案;我收到了用户“flow”的评论,该评论几乎完全脱离了上下文,只有一个正确的点:上面的代码原样是用于处理字节字符串而不是unicode字符串。有一个明显的更新(即 unichr() … xrange(sys.maxunicode+1)),我非常不喜欢,所以我想出了另一个既适用于 unicode 又适用于字节字符串的函数,前提是 Python 保证:新函数如下:
注意星图与元组序列的使用:对于不在替换字典中的任何字符,返回所述字符。
If you work with non-Unicode strings and Python < 3.0, try an alternate
translate
method:This is closer to a "single scan" of the input string, as per your wish.
EDIT
My intention was to create a
unicode.translate
equivalent that was not restricted to single-character replacements, so I came up with the answer above; I got a comment by user "flow" that was almost completely out of context, with a single correct point: the code above, as is, is intended to work with byte strings and not unicode strings. There is an obvious update (i.e. unichr() … xrange(sys.maxunicode+1)) which I strongly dislike, so I came up with another function that works on both unicode and byte strings, given that Python guarantees:The new function follows:
Notice the use of starmap with a sequence of tuples: for any character not in the replacer dict, return said character.