正则表达式中贪婪与否定字符类的问题

发布于 2024-07-07 05:46:41 字数 519 浏览 7 评论 0原文

我有一个非常大的文件,如下所示(见下文)。 我有两种基本的正则表达式选择可供使用(我知道可能还有其他方法,但我真的在尝试比较贪婪和否定字符类)方法。

ftp: [^\D]{1,}
ftp: (\d)+
ftp: \d+

注意:如果我去掉 \d 周围的括号会怎样?

现在 + 是贪婪的,这会强制回溯,但否定字符类需要逐个字符比较。 哪个更有效率? 假设文件非常非常大,因此处理器使用情况的微小差异将因文件的长度而变得夸大。

既然您已经回答了这个问题,如果我的否定字符类非常大,比如说 18 个不同的字符怎么办? 这会改变你的答案吗?

谢谢。

ftp: 1117 字节
FTP: 5696 字节
FTP: 3207 字节
FTP: 5696 字节
ftp:7200字节

I have a very large file that looks like this (see below). I have two basic choices of regex to use on it (I know there may be others but I'm really trying to compare Greedy and Negated Char Class) methods.

ftp: [^\D]{1,}
ftp: (\d)+
ftp: \d+

Note: what if I took off the parense around the \d?

Now + is greedy which forces backtracking but the Negated Char Class require a char-by-char comparison. Which is more efficient? Assume the file is very-very large so minute differences in processor usage will become exaggerated due to the length of the file.

Now that you've answered that, What if my Negated Char Class was very large, say 18 different characters? Would that change your answer?

Thanks.

ftp: 1117 bytes
ftp: 5696 bytes
ftp: 3207 bytes
ftp: 5696 bytes
ftp: 7200 bytes

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

乱了心跳 2024-07-14 05:46:41

[^\D]{1,} 和 \d+ 完全相同。 正则表达式解析器会将 [^\D] 和 \d 编译为内容相同的字符类,而 + 只是 {1,} 的缩写。

如果你想要懒惰重复,你可以添加一个 ? 在最后。

\d+?

字符类通常被编译成 ASCII 字符的位图。 对于 Unicode (>=256),它取决于实现。 一种方法是创建范围列表,并对其使用二分搜索。

对于 ASCII,查找时间在大小上是恒定的。 对于 Unicode,它是对数或线性的。

[^\D]{1,} and \d+ is exactly the same. The regex parser will compile [^\D] and \d into character classes with the equal content, and + is just short for {1,}.

If you want lazy repetition you can add a ? at the end.

\d+?

The character classes are usually compiled into bitmaps for ASCII-characters. For Unicode (>=256) it is implementation dependent. One way could be to create a list of ranges, and use binary search on it.

For ASCII the lookup time is constant over the size. For Unicode it is logarithmic or linear.

秋心╮凉 2024-07-14 05:46:41

你们的表情都一样的贪婪。 正如其他人在这里所说的,除了捕获组之外,它们将以相同的方式执行。

同样在这种情况下,贪婪对于执行速度来说并不重要,因为 \d* 之后没有任何内容。 在这种情况下,表达式将简单地处理它能找到的所有数字,并在遇到空格时停止。 这些表达式不应发生回溯。

为了使其更明确,如果您有这样的表达式,则应该进行回溯:

\d*123

在这种情况下,解析器将吞没所有数字,然后回溯以匹配后面的三个数字。

Both your expressions have the same greediness. As others have said here, except for the capturing group they will execute in the same way.

Also in this case greediness won't matter much at the execution speed since you don't have anything following \d*. In this case the expression will simply process all the digits it can find and stop when the space is encountered. No backtracking should occur with these expressions.

To make it more explicit, backtracking should occur if you have an expression like this:

\d*123

In this case the parser will engulf all the digits, then backtrack to match the three following digits.

浅唱ヾ落雨殇 2024-07-14 05:46:41

是的,我同意 MizardX...这两个表达式在语义上是等效的。 尽管分组可能需要额外的资源。 你问的不是这个。

Yeah, I agree with MizardX... these two expressions are semantically equivalent. Although the grouping could require additional resources. That's not what you were asking about.

玩心态 2024-07-14 05:46:41

我的初步测试表明 [^\D{1,} 比 \d+ 慢一点,在 184M 文件上,前者需要 9.6 秒,而后者需要 8.2 秒,

在不捕获(() 的情况下)两者都快约 1 秒,但两者之间的差异大致相同。

我还做了更广泛的测试,其中捕获的值被打印到 /dev/null,第三个版本在空间上分割,结果:

([^\D]{1,}): ~18s
(\d+): ~17s
(split / /)[1]: ~17s

编辑:分割版本改进,时间减少到与 (\d+)

最快 相同或更低到目前为止的版本(有人可以改进吗?):

while (<>)
{
    if ($foo = (split / /)[1])
    {
        print $foo . "\n";
    }
}

My initial tests show that [^\D{1,} is a bit slower than \d+, on a 184M file the former takes 9.6 seconds while the latter takes 8.2

Without capturing (the ()'s) both are about 1 second faster, but the difference between the two is about the same.

I also did a more extensive test where the captured value is printed to /dev/null, with a third version splitting on the space, results:

([^\D]{1,}): ~18s
(\d+): ~17s
(split / /)[1]: ~17s

Edit: split version improved and time decreased to be the same or lower than (\d+)

Fastest version so far (can anyone improve?):

while (<>)
{
    if ($foo = (split / /)[1])
    {
        print $foo . "\n";
    }
}
屋檐 2024-07-14 05:46:41

这是一个棘手的问题,因为 (\d)+ 由于捕获括号的开销而需要稍长的时间。 如果将其更改为 \d+,它们在我的 Perl / 系统中将花费相同的时间。

This is kind of a trick question as written because (\d)+ takes slightly longer due to the overhead of the capturing parentheses. If you change it to \d+ they take the same amount of time in my Perl / system.

孤寂小茶 2024-07-14 05:46:41

这不是问题的直接答案,但既然您已经知道行的格式,为什么不完全采用不同的方法呢? 例如,您可以在字段之间的空白处使用正则表达式,或者完全避免在空白处使用正则表达式和 split(),这通常比任何正则表达式更快,具体取决于您使用的语言。

Not a direct answer to the question, but why not a different approach altogether, since you know the format of the lines already? For example, you could use a regex on the whitespace between the fields, or avoid regex altogether and split() on the whitespace, which is generally going to be faster than any regular expression, depending on the language you're using.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文