Python 原始字符串和 unicode：如何使用 Web 输入作为正则表达式模式？

发布于 2024-08-18 07:49:59 字数 1002 浏览 12 评论 0原文

编辑：一旦您了解了“r”标志的含义，这个问题就没有意义了。更多详细信息这里。对于寻找快速答案的人，我在下面添加了内容。

如果我在 Python 脚本中手动输入正则表达式，我可以为我的模式字符串使用 4 种标志组合：

p1 = "pattern"
p2 = u"pattern "
p3 = r"pattern"
p4 = ru"pattern"

我有一堆来自 Web 表单输入的 unicode 字符串，并希望将它们用作正则表达式模式。

我想知道应该对字符串应用什么过程，这样我就可以从使用上面的手动表单中得到类似的结果。例如：

import re
assert re.match(p1, some_text) == re.match(someProcess1(web_input), some_text)
assert re.match(p2, some_text) == re.match(someProcess2(web_input), some_text)
assert re.match(p3, some_text) == re.match(someProcess3(web_input), some_text)
assert re.match(p4, some_text) == re.match(someProcess4(web_input), some_text)

someProcess1 到 someProcessN 是什么？为什么？

我想 someProcess2 不需要做任何事情，而 someProcess1 应该做一些 unicode 转换为本地编码。对于原始字符串文字，我一无所知。

原文

EDIT : This question doesn't really make sense once you have picked up what the "r" flag means. More details here.
For people looking for a quick anwser, I added on below.

If I enter a regexp manually in a Python script, I can use 4 combinations of flags for my pattern strings :

p1 = "pattern"
p2 = u"pattern"
p3 = r"pattern"
p4 = ru"pattern"

I have a bunch a unicode strings coming from a Web form input and want to use them as regexp patterns.

I want to know what process I should apply to the strings so I can expect similar result from the usage of the manual form above. Something like :

import re
assert re.match(p1, some_text) == re.match(someProcess1(web_input), some_text)
assert re.match(p2, some_text) == re.match(someProcess2(web_input), some_text)
assert re.match(p3, some_text) == re.match(someProcess3(web_input), some_text)
assert re.match(p4, some_text) == re.match(someProcess4(web_input), some_text)

What would be someProcess1 to someProcessN and why ?

I suppose that someProcess2 doesn't need to do anything while someProcess1 should do some unicode conversion to the local encoding. For the raw string literals, I am clueless.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我们只是彼此的过ke 2024-08-25 07:49:59

除了可能必须正确编码 Unicode（在 Python 2.* 中）之外，不需要任何处理，因为“原始字符串”没有特定的类型——它只是文字的语法，即字符串常量，并且代码片段中没有任何字符串常量，因此无需“处理”任何内容。

回复收藏 0 原文

柏林苍穹下 2024-08-25 07:49:59

请注意第一个示例中的以下内容：

>>> p1 = "pattern"
>>> p2 = u"pattern"
>>> p3 = r"pattern"
>>> p4 = ur"pattern" # it's ur"", not ru"" btw
>>> p1 == p2 == p3 == p4
True

虽然这些构造看起来不同，但它们都执行相同的操作，它们创建一个字符串对象（p1 和 p3 a str 以及 p2 和 p4 a unicode Python 2.x 中的 code> 对象），包含值“pattern”。 u、r 和 ur 只是告诉解析器，如何解释下面的引用字符串，即作为 unicode 文本 (u）和/或原始文本（r），其中编码其他字符的反斜杠被忽略。然而，最终如何创建字符串并不重要，无论它是否是原始字符串，在内部它的存储方式都是相同的。

当您获取 unicode 文本作为输入时，您必须区分它是 unicode 文本还是 str 对象（在 Python 2.x 中）。如果您想使用 unicode 内容，则应该在内部仅使用这些内容，并将所有 str 对象转换为 unicode 对象（使用 str.decode( ) 或使用硬编码文本的 u'text' 语法）。但是，如果将其编码为本地编码，则会遇到 unicode 符号问题。

另一种方法是使用 Python 3，其中 str 对象直接支持 unicode 并将所有内容存储为 unicode，并且您根本不需要关心编码。

Note the following in your first example:

>>> p1 = "pattern"
>>> p2 = u"pattern"
>>> p3 = r"pattern"
>>> p4 = ur"pattern" # it's ur"", not ru"" btw
>>> p1 == p2 == p3 == p4
True

While these constructs look different, they all do the same thing, they create a string object (p1 and p3 a str and p2 and p4 a unicode object in Python 2.x), containing the value "pattern". The u, r and ur just tell the parser, how to interpret the following quoted string, namely as a unicode text (u) and/or a raw text (r) where backslashes to encode other characters are ignored. However in the end it doesn't matter how a string was created, being it a raw string or not, internally it is stored the same.

When you get unicode text as input, you have to differ (in Python 2.x) if it is a unicode text or a str object. If you want to work with the unicode content, you should internally work only with those, and convert all str objects to unicode objects (either with str.decode() or with the u'text' syntax for hard-coded texts). If you however encode it to your local encoding, you will get problems with unicode symbols.

A different approach would be using Python 3, which str object supports unicode directly and stores everything as unicode and where you simply don't need to care about the encoding.

回复收藏 0 原文