如何使这个正则表达式更通用,有时有效,有时无效
我在 java 应用程序中使用了以下正则表达式。 有时它可以正常工作,有时则不能。
<!-- <editable name=(\".*\")?> -->(.*)<!-- </editable> -->
有时我会在它之前/之后有空格,有时会有文本。 标签内的区域也是如此。
主要问题是 name=(\".*\")?> 有时匹配程度超出预期。 我不确定这是否是显而易见的问题,只需查看这段代码即可。
I have the following regex that I am using in a java application. Sometimes it works correctly and sometimes it doesn't.
<!-- <editable name=(\".*\")?> -->(.*)<!-- </editable> -->
Sometimes I will have whitespace before/after it, sometimes there will be text. The same goes for the region within the tags.
The main problem is that name=(\".*\")?> sometimes matches more than it is supposed to. I am not sure if that is something that is obvious to solve, simply looking at this code.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
正如其他人指出的那样,与“name”属性匹配的贪婪
.*
(点星)需要变为非贪婪(.*?
),甚至更好的是,用否定字符类 ([^"]*
) 替换,这样无论正则表达式的其余部分发生什么,它都无法匹配结束引号之外的内容。一旦解决了这个问题,您可能会发现您与其他点星有同样的问题;您也需要使其变得非贪婪,如果它是换行符和/或回车符,我不明白您的评论的意义。正在谈论的是,DOTALL 修饰符让点与它们匹配——当然,
\s
也与它们匹配,我以 Java 字符串文字的形式编写了它,以避免混淆。需要反斜杠以及需要多少反斜杠。在“原始”正则表达式中,每个空白简写中只有一个反斜杠 (
\s*
),并且不需要引号。被转义 ("[^"]*"
)。As others have pointed out, the greedy
.*
(dot-star) that matches the "name" attribute needs to be made non-greedy (.*?
) or even better, replaced with a negated character class ([^"]*
) so it can't match beyond the closing quotation mark no matter what happens in the rest of the regex. Once you've fixed that, you'll probably find you have the same problem with the other dot-star; you need to make it non-greedy too.I don't get the significance of your remarks about whitespace. If it's linefeeds and/or carriage returns you're talking about, the DOTALL modifier lets the dot match those--and of course,
\s
matches them as well.I wrote this in the form of a Java string literal to avoid confusion about where you need backslashes and how many of them you need. In a "raw" regex, there would be only one backslash in each of the whitespace shorthands (
\s*
), and the quotation marks wouldn't need to be escaped ("[^"]*"
).例如,如果名称是某种标识符,我会将
.*
替换为[\w-]*
。或者
[^\"]*
这样它就不会捕获末尾的双引号。编辑:
正如其他文章中提到的,您可能会考虑进行简单的 DOM 遍历,XPath或基于 XQuery 的评估过程,而不是简单的正则表达式,但请注意,您仍然需要在过滤过程中使用正则表达式,因为您只能通过针对正则表达式测试其主体来找到目标注释(因为我怀疑主体是恒定的)。从示例来看)
编辑 2:
注释正文的前导、尾随或内部空格可能会使您的正则表达式失败。 或者,
当您在基于 XML 的搜索进行过滤时:
编辑 3: 修复了两次转义。 em>艾伦·M。
I would replace that
.*
with[\w-]*
for example if name is an identifier of some sort.Or
[^\"]*
so it doesn't capture the end double quote.Edit:
As mentioned in other post you might consider going for a simple DOM traversal, XPath or XQuery based evaluation process instead of a plain regular expression. But note that you will still need to have regex in the filtering process because you can find the target comments only by testing their body against a regular expression (as I doubt the body is constant judjing from the sample).
Edit 2:
It might be that the leading, trailing or internal whitespaces of the comment body makes your regexp fail. Consider putting
\s*
in the beginning and at the end, plus\s+
before the attribute-like thing.Or when you are filtering on XML based search:
Edit 3: Fixed the escapes twice. Thanks Alan M.
默认情况下,
*
乘数是“贪婪”的,这意味着它会尽可能匹配,同时仍然成功匹配模式。您可以使用 *? 禁用此功能,因此请尝试:
the
*
multiplier is "greedy" by default, meaning it matches as much as possible, while still matching the pattern successfully.You can disable this by using *?, so try:
XML 不是常规语言,HTML 或任何其他具有“嵌套”结构的语言也不是。 不要尝试用正则表达式来解析它。
选择 XML 解析器。
XML is not a regular language, nor is HTML or any other language with "nesting" constructs. Don't try to parse it with regular expressions.
Choose an XML parser.