Python 原始字符串和 unicode:如何使用 Web 输入作为正则表达式模式?
编辑:一旦您了解了“r”标志的含义,这个问题就没有意义了。更多详细信息这里。 对于寻找快速答案的人,我在下面添加了内容。
如果我在 Python 脚本中手动输入正则表达式,我可以为我的模式字符串使用 4 种标志组合:
- p1 = "pattern"
- p2 = u"pattern "
- p3 = r"pattern"
- p4 = ru"pattern"
我有一堆来自 Web 表单输入的 unicode 字符串,并希望将它们用作正则表达式模式。
我想知道应该对字符串应用什么过程,这样我就可以从使用上面的手动表单中得到类似的结果。例如:
import re
assert re.match(p1, some_text) == re.match(someProcess1(web_input), some_text)
assert re.match(p2, some_text) == re.match(someProcess2(web_input), some_text)
assert re.match(p3, some_text) == re.match(someProcess3(web_input), some_text)
assert re.match(p4, some_text) == re.match(someProcess4(web_input), some_text)
someProcess1 到 someProcessN 是什么?为什么?
我想 someProcess2 不需要做任何事情,而 someProcess1 应该做一些 unicode 转换为本地编码。对于原始字符串文字,我一无所知。
EDIT : This question doesn't really make sense once you have picked up what the "r" flag means. More details here.
For people looking for a quick anwser, I added on below.
If I enter a regexp manually in a Python script, I can use 4 combinations of flags for my pattern strings :
- p1 = "pattern"
- p2 = u"pattern"
- p3 = r"pattern"
- p4 = ru"pattern"
I have a bunch a unicode strings coming from a Web form input and want to use them as regexp patterns.
I want to know what process I should apply to the strings so I can expect similar result from the usage of the manual form above. Something like :
import re
assert re.match(p1, some_text) == re.match(someProcess1(web_input), some_text)
assert re.match(p2, some_text) == re.match(someProcess2(web_input), some_text)
assert re.match(p3, some_text) == re.match(someProcess3(web_input), some_text)
assert re.match(p4, some_text) == re.match(someProcess4(web_input), some_text)
What would be someProcess1 to someProcessN and why ?
I suppose that someProcess2 doesn't need to do anything while someProcess1 should do some unicode conversion to the local encoding. For the raw string literals, I am clueless.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
除了可能必须正确编码 Unicode(在 Python 2.* 中)之外,不需要任何处理,因为“原始字符串”没有特定的类型——它只是文字的语法,即字符串常量,并且代码片段中没有任何字符串常量,因此无需“处理”任何内容。
Apart from possibly having to encode Unicode properly (in Python 2.*), no processing is needed because there is no specific type for "raw strings" -- it's just a syntax for literals, i.e. for string constants, and you don't have any string constants in your code snippet, so there's nothing to "process".
请注意第一个示例中的以下内容:
虽然这些构造看起来不同,但它们都执行相同的操作,它们创建一个字符串对象(p1 和 p3 a
str
以及 p2 和 p4 aunicode
Python 2.x 中的 code> 对象),包含值“pattern
”。u
、r
和ur
只是告诉解析器,如何解释下面的引用字符串,即作为 unicode 文本 (u
)和/或原始文本(r
),其中编码其他字符的反斜杠被忽略。然而,最终如何创建字符串并不重要,无论它是否是原始字符串,在内部它的存储方式都是相同的。当您获取 unicode 文本作为输入时,您必须区分它是
unicode
文本还是str
对象(在 Python 2.x 中)。如果您想使用 unicode 内容,则应该在内部仅使用这些内容,并将所有str
对象转换为unicode
对象(使用str.decode( )
或使用硬编码文本的u'text'
语法)。但是,如果将其编码为本地编码,则会遇到 unicode 符号问题。另一种方法是使用 Python 3,其中
str
对象直接支持 unicode 并将所有内容存储为 unicode,并且您根本不需要关心编码。Note the following in your first example:
While these constructs look different, they all do the same thing, they create a string object (p1 and p3 a
str
and p2 and p4 aunicode
object in Python 2.x), containing the value "pattern
". Theu
,r
andur
just tell the parser, how to interpret the following quoted string, namely as a unicode text (u
) and/or a raw text (r
) where backslashes to encode other characters are ignored. However in the end it doesn't matter how a string was created, being it a raw string or not, internally it is stored the same.When you get unicode text as input, you have to differ (in Python 2.x) if it is a
unicode
text or astr
object. If you want to work with the unicode content, you should internally work only with those, and convert allstr
objects tounicode
objects (either withstr.decode()
or with theu'text'
syntax for hard-coded texts). If you however encode it to your local encoding, you will get problems with unicode symbols.A different approach would be using Python 3, which
str
object supports unicode directly and stores everything as unicode and where you simply don't need to care about the encoding.“r”标志只是阻止 Python 解释字符串中的“\”。由于网络并不关心它携带什么类型的数据,因此您的网络输入将是一堆字节,您可以自由地按照您想要的方式解释。
因此,为了解决这个问题:
"r" flags just prevent Python from interpreting "\" in a string. Since the Web doesn't care about what kind of data it carries, your web input will be a bunch of bytes you are free to interpret the way you want.
So to address this problem :