为什么这个正则表达式调用 substcont 的次数过多?

发布于 2024-09-03 06:37:42 字数 1178 浏览 3 评论 0原文

这比其他任何事情都更出于好奇,因为我在 Google 上找不到有关此函数的任何有用信息 (CORE::substcont)

在分析和优化一些旧的、缓慢的 XML 解析代码时,我发现以下内容每次执行该行时,正则表达式都会调用 substcont 31 次,并且花费大量时间:

呼叫:10000 时间:2.65s 子呼叫:320000 子时间:1.15s`

  $handle =~s/(>)\s*(<)/$1\n$2/g;
  # spent  1.09s making 310000 calls to main::CORE:substcont, avg 4µs/call
  # spent  58.8ms making  10000 calls to main::CORE:subst, avg 6µs/call

与前一行相比:

呼叫:10000 时间:371ms 子呼叫:30000 子时间:221ms

  $handle =~s/(.*)\s*(<\?)/$1\n$2/g;
    # spent   136ms making 10000 calls to main::CORE:subst, avg 14µs/call
    # spent  84.6ms making 20000 calls to main::CORE:substcont, avg 4µs/call

substcont 调用的数量非常令人惊讶,特别是因为我认为第二个正则表达式会更昂贵。显然,这就是为什么分析是一件好事;-)

我随后更改了这两行以删除不必要的反向引用,对于行为不良的行产生了戏剧性的结果:

呼叫:10000 时间:393ms 子呼叫:10000 子时间:341ms

$handle =~s/>\s*</>\n</g;
  # spent   341ms making 10000 calls to main::CORE:subst, avg 34µs/call
  • 所以,我的问题是 - 为什么原始版本要对 substcont 进行如此多的调用,而 substcont 在正则表达式引擎中到底做了什么,需要这么长时间?

This is more out of curiosity than anything else, as I'm failing to find any useful info on Google about this function (CORE::substcont)

In profiling and optimising some old, slow, XML parsing code I've found that the following regex is calling substcont 31 times for each time the line is executed, and taking a huge amount of time:

Calls: 10000 Time: 2.65s Sub calls: 320000 Time in subs: 1.15s`

  $handle =~s/(>)\s*(<)/$1\n$2/g;
  # spent  1.09s making 310000 calls to main::CORE:substcont, avg 4µs/call
  # spent  58.8ms making  10000 calls to main::CORE:subst, avg 6µs/call

Compared to the immediately preceding line:

Calls: 10000 Time: 371ms Sub calls: 30000 Time in subs: 221ms

  $handle =~s/(.*)\s*(<\?)/$1\n$2/g;
    # spent   136ms making 10000 calls to main::CORE:subst, avg 14µs/call
    # spent  84.6ms making 20000 calls to main::CORE:substcont, avg 4µs/call

The number of substcont calls is quite surprising, especially seeing as I would've thought that the second regex would be more expensive. This is, obviously, why profiling is a Good Thing ;-)

I've subsequently changed both these line to remove the unneccessary backrefs, with dramatic results for the badly-behaving line:

Calls:10000 Time: 393ms Sub calls: 10000 Time in subs: 341ms

$handle =~s/>\s*</>\n</g;
  # spent   341ms making 10000 calls to main::CORE:subst, avg 34µs/call
  • So, my question is - why should the original have been making SO many calls to substcont, and what does substcont even do in the regex engine that takes so long?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

旧伤还要旧人安 2024-09-10 06:37:42

substcont 是 Perl 的“替换迭代器”的内部名称。与s///有关。根据我所掌握的少量信息,似乎在进行反向引用时会触发 substcont 。也就是说,当 $1 存在时。您可以使用 B::Concise 来玩一下它。

这是没有反向引用的简单正则表达式的操作码。

$ perl -MO=Concise,-exec -we'$foo = "foo";  $foo =~ s/(foo)/bar/ig'
1  <0> enter 
2  <;> nextstate(main 1 -e:1) v:{
3  <
gt; const[PV "foo"] s
4  <#> gvsv[*foo] s
5  <2> sassign vKS/2
6  <;> nextstate(main 1 -e:1) v:{
7  <#> gvsv[*foo] s
8  <
gt; const[PV "bar"] s
9  </> subst(/"(foo)"/) vKS
a  <@> leave[1 ref] vKP/REFC
-e syntax OK

和一。

$ perl -MO=Concise,-exec -we'$foo = "foo";  $foo =~ s/(foo)/$1/ig'
1  <0> enter 
2  <;> nextstate(main 1 -e:1) v:{
3  <
gt; const[PV "foo"] s
4  <#> gvsv[*foo] s
5  <2> sassign vKS/2
6  <;> nextstate(main 1 -e:1) v:{
7  <#> gvsv[*foo] s
8  </> subst(/"(foo)"/ replstart->9) vKS
9      <#> gvsv[*1] s
a      <|> substcont(other->8) sK/1
b  <@> leave[1 ref] vKP/REFC
-e syntax OK

这就是我能提供的一切。您可能想尝试 Rx,mjd 的旧正则表达式调试器。

substcont is Perl's internal name for the "substitution iterator". Something to do with s///. Based on what little information I have, it seems substcont is triggered when doing a backref. That is, when $1 is present. You can play with it a bit using B::Concise.

Here's the opcodes of a simple regex without a backref.

$ perl -MO=Concise,-exec -we'$foo = "foo";  $foo =~ s/(foo)/bar/ig'
1  <0> enter 
2  <;> nextstate(main 1 -e:1) v:{
3  <
gt; const[PV "foo"] s
4  <#> gvsv[*foo] s
5  <2> sassign vKS/2
6  <;> nextstate(main 1 -e:1) v:{
7  <#> gvsv[*foo] s
8  <
gt; const[PV "bar"] s
9  </> subst(/"(foo)"/) vKS
a  <@> leave[1 ref] vKP/REFC
-e syntax OK

And one with.

$ perl -MO=Concise,-exec -we'$foo = "foo";  $foo =~ s/(foo)/$1/ig'
1  <0> enter 
2  <;> nextstate(main 1 -e:1) v:{
3  <
gt; const[PV "foo"] s
4  <#> gvsv[*foo] s
5  <2> sassign vKS/2
6  <;> nextstate(main 1 -e:1) v:{
7  <#> gvsv[*foo] s
8  </> subst(/"(foo)"/ replstart->9) vKS
9      <#> gvsv[*1] s
a      <|> substcont(other->8) sK/1
b  <@> leave[1 ref] vKP/REFC
-e syntax OK

That's all I can offer. You may want to try Rx, mjd's old regex debugger.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文