Java 正则表达式:UNGREEDY 标志

发布于 2024-08-06 22:46:11 字数 431 浏览 7 评论 0原文

我想将通用文本处理工具 Texy! 从 PHP 移植到 Java。

该工具使用 preg_match_all("/.../U") 在任何地方进行非贪婪匹配。 所以我正在寻找一个具有一些 UNGREEDY 标志的库。

知道我可以使用.*?语法,但是确实有很多正则表达式我必须覆盖,并在每个更新版本中检查它们。

我检查过

  • ORO - 似乎被放弃了
  • Jakarta Regexp - 不支持
  • java.util.regex - 不支持

有这样的库吗?

谢谢,昂德拉

I'd like to port a generic text processing tool, Texy!, from PHP to Java.

This tool does ungreedy matching everywhere, using preg_match_all("/.../U").
So I am looking for a library, which has some UNGREEDY flag.

I know I could use the .*? syntax, but there are really many regular expressions I would have to overwrite, and check them with every updated version.

I've checked

  • ORO - seems to be abandoned
  • Jakarta Regexp - no support
  • java.util.regex - no support

Is there any such library?

Thanks, Ondra

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

愛上了 2024-08-13 22:46:11

更新:检查文档后,我发现了 LAZY 标志,这是非贪婪的另一个术语。但是它似乎仅在 OpenJDK 中可用

p = Pattern.compile("your regex here", LAZY);
p.matcher("string to match")

原始已弃用响应
老实说,我认为没有一个。

+ 的全部意义是什么?和 *?以便您可以选择贪婪地执行哪些部分以及惰性地执行哪些部分。

贪婪是默认行为,因为这是正则表达式中最常用的 + 和 *。事实上,我想不出有哪个正则表达式解析器能以相反的方式执行此操作。就像使用修饰符来使某些东西变得贪婪一样,默认情况下是惰性匹配。

我知道这不是您正在寻找的答案,但是,我认为您能够使其发挥作用的唯一方法是添加 ?到你的 * 和 + 。从好的方面来说,您可以使用正则表达式来帮助确定哪些内容需要更改。或者,如果所有这些都需要更改,甚至可以为您进行更改。或者您是否可以描述一种模式来识别哪些内容需要更改。

Update: After checking the docs I found the LAZY flag, which is another term for non-greedy. However it only appears to be available in OpenJDK

p = Pattern.compile("your regex here", LAZY);
p.matcher("string to match")

Original deprecated response
I honestly don't think there's one.

The whole point of the +? and *? is so that you can choose which sections to do greedily and which ones to do lazily.

Greedy is the default behaviour because that's the most commonly use of + and * in regular expressions. In fact I can't think of a single regex parser that does it the other way around. As in where a modifier is used to make something greedy, and the default is lazy matching.

I know this isn't the answer you're looking for, but, the only way I think you'll be able to make it work is to add the ? to your *'s and +'s. On the upside you can use regular expressions to help determine which ones need to be changed. Or even make the changes for you if all of them need to be changed. Or if you can can describe a pattern that identifies which need to be changed.

聽兲甴掵 2024-08-13 22:46:11

我建议您创建自己的修改后的 Java 库。只需将 java.util.regex 源复制到您自己的包中即可。

Sun JDK 1.6 Pattern.java 类提供了这些默认标志:

static final int GREEDY     = 0;

static final int LAZY       = 1;

static final int POSSESSIVE = 2;

您会注意到这些标志仅使用几次,并且修改起来很简单。采取以下示例:

    case '*':
        ch = next();
        if (ch == '?') {
            next();
            return new Curly(prev, 0, MAX_REPS, LAZY);
        } else if (ch == '+') {
            next();
            return new Curly(prev, 0, MAX_REPS, POSSESSIVE);
        }
        return new Curly(prev, 0, MAX_REPS, GREEDY);

只需更改最后一行以使用“LAZY”标志而不是 GREEDY 标志。由于您希望正则表达式库的行为类似于 PHP 库,因此这可能是最好的方法。

I suggest you create your own modified Java library. Simply copy the java.util.regex source into your own package.

The Sun JDK 1.6 Pattern.java class offers these default flags:

static final int GREEDY     = 0;

static final int LAZY       = 1;

static final int POSSESSIVE = 2;

You'll notice that these flags are only used a couple of times, and it would be trivial to modify. Take the following example:

    case '*':
        ch = next();
        if (ch == '?') {
            next();
            return new Curly(prev, 0, MAX_REPS, LAZY);
        } else if (ch == '+') {
            next();
            return new Curly(prev, 0, MAX_REPS, POSSESSIVE);
        }
        return new Curly(prev, 0, MAX_REPS, GREEDY);

Simply change the last line to use the 'LAZY' flag instead of the GREEDY flag. Since your wanting a regex library to behave like the PHP one, this might be the best way to go.

讽刺将军 2024-08-13 22:46:11

关于检查和重新检查所有正则表达式的想法,您确定 php 和 java 库在语法上足够一致,因此您不必这样做吗?我首先要做的就是仔细检查所有这些并编写一些测试(输入和输出),并确保它们在两种实现中工作相同。然后设计一种自动运行它们的方法,这样您就可以应对未来的升级和不兼容问题。您仍然需要调整一些东西,但至少您知道在哪里。

About the idea of checking and rechecking all regular expressions, are you sure that the php and java libraries agree enough on syntax that you wouldn't have to do this anyway? What I'd do up front is go through them all and write some tests (input and output) and make sure that they work the same in both implementations. Then devise a way to run them automatically and you will be covered for future upgrades and incompatibilities. You'll still need to tweak stuff, but at least you'll know where.

别念他 2024-08-13 22:46:11

您也许可以使用“com.caucho.quercus.lib.regexp.JavaRegexpModule”。 Quercus 是 PHP 的 Java 实现,正则表达式库实现了 PHP 正则表达式 语法和方法名称

You may be able to use 'com.caucho.quercus.lib.regexp.JavaRegexpModule'. Quercus is a Java implementation of PHP, and the regex library implements the PHP regex syntax and method names.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文