修复正则表达式以解决 ICU/RegexKitLite 错误
我正在使用 RegexKitLite,它又使用 ICU 作为其引擎。尽管有文档,但在搜索“xxxxxxxxxx”时,像 /x*/ 这样的正则表达式将匹配空字符串。它的行为应该像 /x*?/ 一样。我想在存在此错误时绕过它,并且当正则表达式匹配返回 0 长度结果时,我正在考虑将任何未转义的 * 重写为 + 。我天真的猜测是,用 + 代替 * 的正则表达式将始终返回正确结果的子集。这会带来什么意想不到的后果?我走的路对吗?
FWIW,ICU 还提供了 *+ 运算符,但它也不起作用。
编辑:我应该更清楚:这是交互式应用程序的搜索字段。我无法控制用户输入的正则表达式。损坏的 * 支持似乎是 ICU 中的一个错误。我当然希望我不需要在我的代码中包含该 POS,但这是镇上唯一的游戏。
I'm using RegexKitLite, which in turn uses ICU as its engine. Despite the documentation, a regex like /x*/ when searching against "xxxxxxxxxxx" will match empty string. It is behaving like /x*?/ should. I would like to route around this bug when it's present, and I'm considering rewriting any unescaped * as + when a regex match returns a 0-length result. My naïve guess is that the regex with +s in placeof *s will always return a subset of the correct results. What are the unexpected consequences of this? Am I going the right way?
FWIW, ICU also offers a *+ operator, but it doesn't work either.
EDIT: I should have been clearer: this is for the search field of an interactive app. I have no control over the regex that the user enters. The broken * support appears to be a bug in ICU. I sure wish I didn't need to include that POS in my code, but it's the only game in town.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您只是将每个
*
量词更改为+
,则正则表达式将无法在*
应该 匹配了零次。换句话说,问题将从总是匹配零转变为从不匹配零。如果你问我,这两种方法都没有用。但是,您也许可以使用负前瞻来单独处理零出现的情况。例如,
x*
可以重写为(?:(?!x)|x+)
。我知道这很可怕,但这是我目前能想到的最独立的解决方案。您也必须对所有格星号 (*+
) 执行此操作,但不能对不情愿的星号 (*?
) 执行此操作。这是表格形式:
More complex atoms would need to have their own parentheses preserved:
You could probably drop them inside the lookahead, but they don't hurt anything except readability, and that's a lost cause anyway. :D If the
{min,}
and{min,max}
forms are affected too, they would get the same treatment (with the same modifications for possessive variants):我认为条件语句 -
(?(condition)yes-pattern|no-pattern)
-- 在这里非常适合;不幸的是,ICU似乎并不支持他们。If you simply change every
*
quantifier to a+
, the regex will fail to work in those instances where the*
should have matched zero occurrences. In other words, the problem will have morphed from always matching zero to never matching zero. If you ask me, it's useless either way.However, you might be able to handle the zero-occurrences case separately, with a negative lookahead. For example,
x*
could be rewritten as(?:(?!x)|x+)
. It's hideous I know, but it's the most self-contained fix I can envision at the moment. You would have to do this for possessive stars as well (*+
), but not reluctant stars (*?
).Here it is in table form:
More complex atoms would need to have their own parentheses preserved:
You could probably drop them inside the lookahead, but they don't hurt anything except readability, and that's a lost cause anyway. :D If the
{min,}
and{min,max}
forms are affected too, they would get the same treatment (with the same modifications for possessive variants):It occurs to me that conditionals--
(?(condition)yes-pattern|no-pattern)
--would be a perfect fit here; unfortunately, ICU doesn't seem to support them.我不能说有问题的代码哪里出了问题,但我可以自信地说这个特定的错误不在 ICU 库中。 (我是 ICU 正则表达式包的作者。)
我同意上面表达的观点,要做的事情不是尝试通过调整正则表达式模式来解决问题,而是要了解根本问题是什么。可能犯了一些简单的错误,从提出的原始问题来看并不清楚。
I can't say where things may have gone wrong with the code in question, but I can say with confidence that this specific bug is not in the ICU library. (I'm the author of the ICU regular expression package.)
I agree with the sentiment expressed above, the thing to do is not to try to hack around the problem by tweaking the regexp pattern, but to understand what the underlying problem is. There's probably some simple mistake being made that isn't clear from the original question as posed.
\*
和[*]
都是字面星号,因此简单的替换可能不起作用。其实不要做动态重写,太复杂了。首先尝试静态调整你的正则表达式。
x*
相当于x{0,}
和(?:x+)?
。Both
\*
and[*]
are literal asterisks, so a naive replacement mightn't work.In fact, don't do dynamic rewriting, it's too complicated. Try to tweak your regexes statically first.
x*
is equivalent tox{0,}
and(?:x+)?
.是的,使用该策略:
(伪代码)
if ($str =~ /x*/ && $str =~ /(x+)/) {
打印“'$1'\n”;
但
真正的问题是如你所说的BUG。到底为什么量词的基本结构被搞砸了?这不是您应该包含在代码中的模块。
Yeah, use that strategy:
(pseudo code)
if ($str =~ /x*/ && $str =~ /(x+)/) {
print "'$1'\n";
}
But the real problem is the BUG as you say. Why on earth is the basic construct of quantifiers screwed up? This is not a module you should include in your code.