模式意外结束:Python Regex
当我使用以下 python 正则表达式执行下面描述的功能时,出现错误 Unexpected end of Pattern。
正则表达式:
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
(CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)
此正则表达式的用途:
输入:
CODE876
CODE223
matchjustCODE657
CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665
应该匹配:
CODE876
CODE223
CODE657
CODE697
并将出现的内容替换为
http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743
不应该匹配:
code876
testing1CODE888
testing2CODE776
example3CODE654
example2CODE098
http://replaced/CODE665
最终输出
http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665
编辑和更新1
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)
错误不再发生。但这与所需的任何模式都不匹配。是匹配组有问题还是匹配本身有问题。因为当我这样编译这个正则表达式时,我的输入没有匹配项。
编辑和更新 2
f=open("/Users/mymac/Desktop/regex.txt")
s=f.read()
s1 = re.sub(r'((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)',
r'\g<1><a href="http://productcode/\g<2>">\g<2></a>', s)
print s1
输入
CODE123 CODE765 testing1CODE123 example1CODE345 http://www.coding.com/CODE333 CODE345
CODE234
CODE333
输出
<a href="http://productcode/CODE123">CODE123</a> <a href="http://productcode/CODE765">CODE765</a> testing1<a href="http://productcode/CODE123">CODE123</a> example1<a href="http://productcode/CODE345">CODE345</a> http://www.coding.com/<a href="http://productcode/CODE333">CODE333</a> <a href="http://productcode/CODE345">CODE345</a>
<a href="http://productcode/CODE234">CODE234</a>
<a href="http://productcode/CODE333">CODE333</a>
正则表达式适用于原始输入,但不适用于来自文本文件的字符串输入。
请参阅输入 4 和 5 了解更多结果 http://ideone.com/3w1E3
When I use the following python regex to perform the functionality described below, I get the error Unexpected end of Pattern.
Regex:
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
(CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)
Purpose of this regex:
INPUT:
CODE876
CODE223
matchjustCODE657
CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665
Should match:
CODE876
CODE223
CODE657
CODE697
and replace occurrences with
http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743
Should Not match:
code876
testing1CODE888
testing2CODE776
example3CODE654
example2CODE098
http://replaced/CODE665
FINAL OUTPUT
http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665
EDIT and UPDATE 1
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)
The error is no more happening. But this does not match any of the patterns as needed. Is there a problem with matching groups or the matching itself. Because when I compile this regex as such, I get no match to my input.
EDIT AND UPDATE 2
f=open("/Users/mymac/Desktop/regex.txt")
s=f.read()
s1 = re.sub(r'((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)',
r'\g<1><a href="http://productcode/\g<2>">\g<2></a>', s)
print s1
INPUT
CODE123 CODE765 testing1CODE123 example1CODE345 http://www.coding.com/CODE333 CODE345
CODE234
CODE333
OUTPUT
<a href="http://productcode/CODE123">CODE123</a> <a href="http://productcode/CODE765">CODE765</a> testing1<a href="http://productcode/CODE123">CODE123</a> example1<a href="http://productcode/CODE345">CODE345</a> http://www.coding.com/<a href="http://productcode/CODE333">CODE333</a> <a href="http://productcode/CODE345">CODE345</a>
<a href="http://productcode/CODE234">CODE234</a>
<a href="http://productcode/CODE333">CODE333</a>
Regex works for Raw input, but not for string input from a text file.
See Input 4 and 5 for more results http://ideone.com/3w1E3
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您的主要问题是
(?-i)
,就 Python 2.7 和 3.2 而言,这是一厢情愿的想法。欲了解更多详情,请参阅下文。看起来建议被置若罔闻......这是 re.VERBOSE 格式的模式:
Your main problem is the
(?-i)
thingy which is wishful thinking as far as Python 2.7 and 3.2 are concerned. For more details, see below.Looks like suggestions fall on deaf ears ... Here's the pattern in re.VERBOSE format:
好吧,看起来问题出在
(?-i)
上,这令人惊讶。内联修饰符语法的目的是让您将修饰符应用于正则表达式的选定部分。至少,大多数口味都是这样的。在Python中,它们似乎总是修改整个正则表达式,与外部标志相同(re.I
、re.M
等)。替代的(?i:xyz)
语法也不起作用。顺便说一句,我没有看到任何理由使用三个单独的前瞻,就像您在这里所做的那样:
只需将它们或在一起:
编辑:我们似乎已经从正则表达式抛出异常的问题转移到了为什么的问题它不起作用。我不确定我是否理解您的要求,但下面的正则表达式和替换字符串会返回您想要的结果。
在 ideone.com 上查看实际情况
这就是您所追求的吗?
编辑:我们现在知道替换是在较大的文本中完成的,而不是在独立的字符串上完成的。这使得问题变得更加困难,但我们也知道完整的 URL(以
http://
开头的 URL)仅出现在已经存在的锚元素中。这意味着我们可以将正则表达式分为两种选择:一种匹配完整的...
元素,另一种匹配我们的目标字符串。诀窍是使用函数而不是静态字符串进行替换。每当正则表达式匹配锚元素时,该函数就会在 group(1) 中找到它并原样返回。否则,它使用 group(2) 和 group(3) 来构建一个新的组。
这是另一个演示(我知道那是可怕的代码,但我现在太累了学习更Pythonic的方式。)
Okay, it looks like the problem is the
(?-i)
, which is surprising. The purpose of the inline-modifier syntax is to let you apply modifiers to selected portions of the regex. At least, that's how they work in most flavors. In Python it seems they always modify the whole regex, same as the external flags (re.I
,re.M
, etc.). The alternative(?i:xyz)
syntax doesn't work either.On a side note, I don't see any reason to use three separate lookaheads, as you did here:
Just OR them together:
EDIT: We seem to have moved from the question of why the regex throws exceptions, to the question of why it doesn't work. I'm not sure I understand your requirements, but the regex and replacement string below return the results you want.
see it in action one ideone.com
Is that what you're after?
EDIT: We now know that the replacements are being done within a larger text, not on standalone strings. That's makes the problem much more difficult, but we also know the full URLs (the ones that start with
http://
) only occur in already-existing anchor elements. That means we can split the regex into two alternatives: one to match complete<a>...</a>
elements, and one to match our the target strings.The trick is to use a function instead of a static string for the replacement. Whenever the regex matches an anchor element, the function will find it in group(1) and return it unchanged. Otherwise, it uses group(2) and group(3) to build a new one.
here's another demo (I know that's horrible code, but I'm too tired right now to learn a more pythonic way.)
我看到的唯一问题是您使用错误的捕获组进行替换。
在这里,我将第一个组也设置为非捕获组
请参阅Regexr 上的此处
The only problem I see is that you replace using the wrong capturing group.
Here I made the first one also a non capturing group
See it here on Regexr
对于复杂的正则表达式,请使用 re.X 标志 来记录您的内容正在做并确保括号正确匹配(即通过使用缩进来指示当前的嵌套级别)。
For complex regexes, use the re.X flag to document what you're doing and to make sure the brackets match up correctly (i.e. by using indentation to indicate the current level of nesting).