为什么这个正则表达式调用 substcont 的次数过多?
这比其他任何事情都更出于好奇,因为我在 Google 上找不到有关此函数的任何有用信息 (CORE::substcont)
在分析和优化一些旧的、缓慢的 XML 解析代码时,我发现以下内容每次执行该行时,正则表达式都会调用 substcont 31 次,并且花费大量时间:
呼叫:10000 时间:2.65s 子呼叫:320000 子时间:1.15s`
$handle =~s/(>)\s*(<)/$1\n$2/g;
# spent 1.09s making 310000 calls to main::CORE:substcont, avg 4µs/call
# spent 58.8ms making 10000 calls to main::CORE:subst, avg 6µs/call
与前一行相比:
呼叫:10000 时间:371ms 子呼叫:30000 子时间:221ms
$handle =~s/(.*)\s*(<\?)/$1\n$2/g;
# spent 136ms making 10000 calls to main::CORE:subst, avg 14µs/call
# spent 84.6ms making 20000 calls to main::CORE:substcont, avg 4µs/call
substcont 调用的数量非常令人惊讶,特别是因为我认为第二个正则表达式会更昂贵。显然,这就是为什么分析是一件好事;-)
我随后更改了这两行以删除不必要的反向引用,对于行为不良的行产生了戏剧性的结果:
呼叫:10000 时间:393ms 子呼叫:10000 子时间:341ms
$handle =~s/>\s*</>\n</g;
# spent 341ms making 10000 calls to main::CORE:subst, avg 34µs/call
- 所以,我的问题是 - 为什么原始版本要对 substcont 进行如此多的调用,而 substcont 在正则表达式引擎中到底做了什么,需要这么长时间?
This is more out of curiosity than anything else, as I'm failing to find any useful info on Google about this function (CORE::substcont)
In profiling and optimising some old, slow, XML parsing code I've found that the following regex is calling substcont 31 times for each time the line is executed, and taking a huge amount of time:
Calls: 10000 Time: 2.65s Sub calls: 320000 Time in subs: 1.15s`
$handle =~s/(>)\s*(<)/$1\n$2/g;
# spent 1.09s making 310000 calls to main::CORE:substcont, avg 4µs/call
# spent 58.8ms making 10000 calls to main::CORE:subst, avg 6µs/call
Compared to the immediately preceding line:
Calls: 10000 Time: 371ms Sub calls: 30000 Time in subs: 221ms
$handle =~s/(.*)\s*(<\?)/$1\n$2/g;
# spent 136ms making 10000 calls to main::CORE:subst, avg 14µs/call
# spent 84.6ms making 20000 calls to main::CORE:substcont, avg 4µs/call
The number of substcont calls is quite surprising, especially seeing as I would've thought that the second regex would be more expensive. This is, obviously, why profiling is a Good Thing ;-)
I've subsequently changed both these line to remove the unneccessary backrefs, with dramatic results for the badly-behaving line:
Calls:10000 Time: 393ms Sub calls: 10000 Time in subs: 341ms
$handle =~s/>\s*</>\n</g;
# spent 341ms making 10000 calls to main::CORE:subst, avg 34µs/call
- So, my question is - why should the original have been making SO many calls to substcont, and what does substcont even do in the regex engine that takes so long?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
substcont
是 Perl 的“替换迭代器”的内部名称。与s///
有关。根据我所掌握的少量信息,似乎在进行反向引用时会触发 substcont 。也就是说,当$1
存在时。您可以使用 B::Concise 来玩一下它。这是没有反向引用的简单正则表达式的操作码。
和一。
这就是我能提供的一切。您可能想尝试 Rx,mjd 的旧正则表达式调试器。
substcont
is Perl's internal name for the "substitution iterator". Something to do withs///
. Based on what little information I have, it seemssubstcont
is triggered when doing a backref. That is, when$1
is present. You can play with it a bit using B::Concise.Here's the opcodes of a simple regex without a backref.
And one with.
That's all I can offer. You may want to try Rx, mjd's old regex debugger.