Pyre2 比内置 re 模块慢?
使用 pyre2
时 (https://github.com/axiak/pyre2) ,遇到了性能问题(匹配时间)。
我有三个程序:
使用内置 re 模块的纯 Python: https://gist .github.com/1873402
使用 Pyre2 的 Python:https://gist.github.com/1873402。 (大部分代码与No.1程序相同。除了使用内置re时,它会将utf-8字符串解码为unicode,这在使用pyre2时不需要)
C/C++使用re2: https://gist.github.com/1873417
我测量两个时间:正则表达式预编译时间和匹配时间。
1号程序:1.65s 1.25s
3号节目:0.02s 0.8s
他们全部由相同的正则表达式和输入提供。 (re2
支持所有正则表达式)
然后我按照有关 Cython 中的分析的文档进行操作。得到以下结果:
ncalls tottime percall cumtime percall filename:lineno(function) 652884 16.477 0.000 25.349 0.000 re2.pyx:394(_search) 9479 6.059 0.001 41.806 0.004 export_plain.py:60(match) 652884 4.243 0.000 33.602 0.000 {method 'search' of 're2.Pattern' objects} 652884 4.010 0.000 29.359 0.000 re2.pyx:442(search) 652884 3.056 0.000 3.056 0.000 re2.pyx:114(__init__) 652953 2.145 0.000 2.145 0.000 {isinstance} 652884 2.002 0.000 2.002 0.000 re2.pyx:123(__dealloc__) 652953 1.911 0.000 1.911 0.000 re2.pyx:75(unicode_to_bytestring) 652953 1.902 0.000 1.902 0.000 re2.pyx:86(pystring_to_bytestring) 1 0.330 0.330 42.492 42.492 export_plain.py:98(export_fname) 9479 0.173 0.000 0.173 0.000 {built-in method sub} 10000 0.120 0.000 0.120 0.000 {method 'split' of 'str' objects} 8967 0.063 0.000 0.099 0.000 re2.pyx:801(get) 10069 0.061 0.000 0.061 0.000 {method 'strip' of 'str' objects} 69 0.043 0.001 0.146 0.002 re2.pyx:806(prepare_pattern) 9036 0.038 0.000 0.038 0.000 re2.pyx:788(__next) 69 0.022 0.000 0.169 0.002 re2.pyx:905(_compile) 1 0.005 0.005 0.177 0.177 export_plain.py:36(load) 69 0.002 0.000 0.003 0.000 re2.pyx:784(__init__) 69 0.001 0.000 0.170 0.002 re2.pyx:763(compile) 38 0.001 0.000 0.001 0.000 {method 'write' of 'file' objects} 69 0.001 0.000 0.171 0.002 {re2.compile} 1 0.001 0.001 42.669 42.669 export_plain.py:160(main) 3 0.000 0.000 0.000 0.000 {open} 69 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects} 19 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects} 1 0.000 0.000 0.000 0.000 genericpath.py:38(isdir) 1 0.000 0.000 42.669 42.669 export_plain.py:153(run_re2_test) 1 0.000 0.000 0.000 0.000 {posix.stat} 4 0.000 0.000 0.000 0.000 {time.time} 1 0.000 0.000 0.000 0.000 posixpath.py:59(join) 1 0.000 0.000 42.670 42.670 :1() 1 0.000 0.000 0.000 0.000 {method 'encode' of 'unicode' objects} 3 0.000 0.000 0.000 0.000 {method 'rfind' of 'str' objects} 2 0.000 0.000 0.000 0.000 posixpath.py:109(basename) 1 0.000 0.000 0.000 0.000 posixpath.py:117(dirname) 1 0.000 0.000 0.000 0.000 stat.py:40(S_ISDIR) 2 0.000 0.000 0.000 0.000 {len} 1 0.000 0.000 0.000 0.000 {method 'extend' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects} 1 0.000 0.000 0.000 0.000 {method 'endswith' of 'str' objects} 1 0.000 0.000 0.000 0.000 stat.py:24(S_IFMT) 1 0.000 0.000 0.000 0.000 {method '__enter__' of 'file' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
看起来 _search
函数 (re2.pyx:393) 占用了太多时间。 但我不知道这个和纯 C 版本怎么会有这么不同。
附: Pyre2 修订版:提交 543f228
re2 修订版:变更集:79:0c439a6bd795
我猜实际的 Match
函数 (re2.pyx:424) 花费了大部分该函数中的时间。
然后,我将 Match 函数重构为 cdef 函数 _my_match
,以便我可以在配置文件结果中看到它,还将 StringPiece
分配重构为 cdef 函数 _alloc_sp
。 (修改细节:https://gist.github.com/1873993)重新profile一下,然后得到:
Mon Feb 20 20:52:47 2012 Profile.prof 3975043 function calls in 28.265 CPU seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 652884 10.060 0.000 20.230 0.000 re2.pyx:452(search) 652884 4.131 0.000 28.201 0.000 {method 'search' of 're2.Pattern' objects} 652884 3.647 0.000 3.647 0.000 re2.pyx:394(_my_match) 9479 3.037 0.000 31.238 0.003 export_plain.py:62(match) 652884 2.901 0.000 2.901 0.000 re2.pyx:443(_alloc_sp) 652953 1.814 0.000 1.814 0.000 re2.pyx:86(pystring_to_bytestring) 652953 1.808 0.000 1.808 0.000 re2.pyx:75(unicode_to_bytestring) 1 0.332 0.332 31.926 31.926 export_plain.py:96(export_fname) 9479 0.169 0.000 0.169 0.000 {built-in method sub} 10000 0.122 0.000 0.122 0.000 {method 'split' of 'str' objects} 8967 0.065 0.000 0.099 0.000 re2.pyx:849(get) 10069 0.064 0.000 0.064 0.000 {method 'strip' of 'str' objects} 69 0.042 0.001 0.142 0.002 re2.pyx:854(prepare_pattern) 9036 0.035 0.000 0.035 0.000 re2.pyx:836(__next) 69 0.023 0.000 0.166 0.002 re2.pyx:953(_compile) 1 0.003 0.003 32.103 32.103 export_plain.py:158(main) 1 0.003 0.003 0.174 0.174 export_plain.py:36(load) 69 0.002 0.000 0.168 0.002 re2.pyx:811(compile) 38 0.001 0.000 0.001 0.000 {method 'write' of 'file' objects} 69 0.001 0.000 0.169 0.002 {re2.compile} 69 0.001 0.000 0.001 0.000 re2.pyx:832(__init__) 1 0.001 0.001 32.104 32.104 export_plain.py:151(run_re2_test) 1 0.000 0.000 32.105 32.105 :1() 2 0.000 0.000 0.000 0.000 {len} 3 0.000 0.000 0.000 0.000 {open} 1 0.000 0.000 0.000 0.000 {method 'extend' of 'list' objects} 69 0.000 0.000 0.000 0.000 {isinstance} 69 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects} 19 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects} 4 0.000 0.000 0.000 0.000 {time.time} 1 0.000 0.000 0.000 0.000 {method 'encode' of 'unicode' objects} 1 0.000 0.000 0.000 0.000 posixpath.py:59(join) 1 0.000 0.000 0.000 0.000 {posix.stat} 1 0.000 0.000 0.000 0.000 genericpath.py:38(isdir) 2 0.000 0.000 0.000 0.000 posixpath.py:109(basename) 3 0.000 0.000 0.000 0.000 {method 'rfind' of 'str' objects} 1 0.000 0.000 0.000 0.000 posixpath.py:117(dirname) 1 0.000 0.000 0.000 0.000 stat.py:40(S_ISDIR) 1 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects} 1 0.000 0.000 0.000 0.000 {method 'endswith' of 'str' objects} 1 0.000 0.000 0.000 0.000 {method '__enter__' of 'file' objects} 1 0.000 0.000 0.000 0.000 stat.py:24(S_IFMT) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
但是 search
仍然占用了很多时间(tottime 为 10.060)。
任何人都可以找出问题所在吗?
When using pyre2
(https://github.com/axiak/pyre2), I encountered a performance problem (matching time).
I have three programs:
pure Python using built-in re module: https://gist.github.com/1873402
Python using Pyre2: https://gist.github.com/1873402. (Most part of the code is the same as no.1 program. Except when using built-in re, it will decode utf-8 string to unicode, which is NOT necessary when using pyre2)
C/C++ using re2: https://gist.github.com/1873417
I measured two time: regex pre-compile time and matching time.
no.1 program: 1.65s 1.25s
no.2 program: 0.04s 1.8s
no.3 program: 0.02s 0.8s
They all feed by the same regex and input. (All regexes are supported by re2
)
Then I followed the documentation about profiling in Cython. Got the following result:
ncalls tottime percall cumtime percall filename:lineno(function) 652884 16.477 0.000 25.349 0.000 re2.pyx:394(_search) 9479 6.059 0.001 41.806 0.004 export_plain.py:60(match) 652884 4.243 0.000 33.602 0.000 {method 'search' of 're2.Pattern' objects} 652884 4.010 0.000 29.359 0.000 re2.pyx:442(search) 652884 3.056 0.000 3.056 0.000 re2.pyx:114(__init__) 652953 2.145 0.000 2.145 0.000 {isinstance} 652884 2.002 0.000 2.002 0.000 re2.pyx:123(__dealloc__) 652953 1.911 0.000 1.911 0.000 re2.pyx:75(unicode_to_bytestring) 652953 1.902 0.000 1.902 0.000 re2.pyx:86(pystring_to_bytestring) 1 0.330 0.330 42.492 42.492 export_plain.py:98(export_fname) 9479 0.173 0.000 0.173 0.000 {built-in method sub} 10000 0.120 0.000 0.120 0.000 {method 'split' of 'str' objects} 8967 0.063 0.000 0.099 0.000 re2.pyx:801(get) 10069 0.061 0.000 0.061 0.000 {method 'strip' of 'str' objects} 69 0.043 0.001 0.146 0.002 re2.pyx:806(prepare_pattern) 9036 0.038 0.000 0.038 0.000 re2.pyx:788(__next) 69 0.022 0.000 0.169 0.002 re2.pyx:905(_compile) 1 0.005 0.005 0.177 0.177 export_plain.py:36(load) 69 0.002 0.000 0.003 0.000 re2.pyx:784(__init__) 69 0.001 0.000 0.170 0.002 re2.pyx:763(compile) 38 0.001 0.000 0.001 0.000 {method 'write' of 'file' objects} 69 0.001 0.000 0.171 0.002 {re2.compile} 1 0.001 0.001 42.669 42.669 export_plain.py:160(main) 3 0.000 0.000 0.000 0.000 {open} 69 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects} 19 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects} 1 0.000 0.000 0.000 0.000 genericpath.py:38(isdir) 1 0.000 0.000 42.669 42.669 export_plain.py:153(run_re2_test) 1 0.000 0.000 0.000 0.000 {posix.stat} 4 0.000 0.000 0.000 0.000 {time.time} 1 0.000 0.000 0.000 0.000 posixpath.py:59(join) 1 0.000 0.000 42.670 42.670 :1() 1 0.000 0.000 0.000 0.000 {method 'encode' of 'unicode' objects} 3 0.000 0.000 0.000 0.000 {method 'rfind' of 'str' objects} 2 0.000 0.000 0.000 0.000 posixpath.py:109(basename) 1 0.000 0.000 0.000 0.000 posixpath.py:117(dirname) 1 0.000 0.000 0.000 0.000 stat.py:40(S_ISDIR) 2 0.000 0.000 0.000 0.000 {len} 1 0.000 0.000 0.000 0.000 {method 'extend' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects} 1 0.000 0.000 0.000 0.000 {method 'endswith' of 'str' objects} 1 0.000 0.000 0.000 0.000 stat.py:24(S_IFMT) 1 0.000 0.000 0.000 0.000 {method '__enter__' of 'file' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
It looks like that the _search
function (re2.pyx:393) take up too much time.
But I don't how can be so different between this and the pure C version.
PS:
Pyre2 revision : commit 543f228
re2 revision : changeset: 79:0c439a6bd795
I guess the actual Match
function (re2.pyx:424) cost most of the time in this function.
Then I refactor Match function to a cdef function _my_match
so that I can see it in profile result, also refactor StringPiece
allocation to cdef function _alloc_sp
. (Modification detail: https://gist.github.com/1873993) Re-profile it, then get:
Mon Feb 20 20:52:47 2012 Profile.prof 3975043 function calls in 28.265 CPU seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 652884 10.060 0.000 20.230 0.000 re2.pyx:452(search) 652884 4.131 0.000 28.201 0.000 {method 'search' of 're2.Pattern' objects} 652884 3.647 0.000 3.647 0.000 re2.pyx:394(_my_match) 9479 3.037 0.000 31.238 0.003 export_plain.py:62(match) 652884 2.901 0.000 2.901 0.000 re2.pyx:443(_alloc_sp) 652953 1.814 0.000 1.814 0.000 re2.pyx:86(pystring_to_bytestring) 652953 1.808 0.000 1.808 0.000 re2.pyx:75(unicode_to_bytestring) 1 0.332 0.332 31.926 31.926 export_plain.py:96(export_fname) 9479 0.169 0.000 0.169 0.000 {built-in method sub} 10000 0.122 0.000 0.122 0.000 {method 'split' of 'str' objects} 8967 0.065 0.000 0.099 0.000 re2.pyx:849(get) 10069 0.064 0.000 0.064 0.000 {method 'strip' of 'str' objects} 69 0.042 0.001 0.142 0.002 re2.pyx:854(prepare_pattern) 9036 0.035 0.000 0.035 0.000 re2.pyx:836(__next) 69 0.023 0.000 0.166 0.002 re2.pyx:953(_compile) 1 0.003 0.003 32.103 32.103 export_plain.py:158(main) 1 0.003 0.003 0.174 0.174 export_plain.py:36(load) 69 0.002 0.000 0.168 0.002 re2.pyx:811(compile) 38 0.001 0.000 0.001 0.000 {method 'write' of 'file' objects} 69 0.001 0.000 0.169 0.002 {re2.compile} 69 0.001 0.000 0.001 0.000 re2.pyx:832(__init__) 1 0.001 0.001 32.104 32.104 export_plain.py:151(run_re2_test) 1 0.000 0.000 32.105 32.105 :1() 2 0.000 0.000 0.000 0.000 {len} 3 0.000 0.000 0.000 0.000 {open} 1 0.000 0.000 0.000 0.000 {method 'extend' of 'list' objects} 69 0.000 0.000 0.000 0.000 {isinstance} 69 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects} 19 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects} 4 0.000 0.000 0.000 0.000 {time.time} 1 0.000 0.000 0.000 0.000 {method 'encode' of 'unicode' objects} 1 0.000 0.000 0.000 0.000 posixpath.py:59(join) 1 0.000 0.000 0.000 0.000 {posix.stat} 1 0.000 0.000 0.000 0.000 genericpath.py:38(isdir) 2 0.000 0.000 0.000 0.000 posixpath.py:109(basename) 3 0.000 0.000 0.000 0.000 {method 'rfind' of 'str' objects} 1 0.000 0.000 0.000 0.000 posixpath.py:117(dirname) 1 0.000 0.000 0.000 0.000 stat.py:40(S_ISDIR) 1 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects} 1 0.000 0.000 0.000 0.000 {method 'endswith' of 'str' objects} 1 0.000 0.000 0.000 0.000 {method '__enter__' of 'file' objects} 1 0.000 0.000 0.000 0.000 stat.py:24(S_IFMT) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
But search
still take up so much time (10.060 in tottime).
Anyone can figure out what's the problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

好吧,这取决于......pyre2 渐近更快,但对于每个特定正则表达式不一定更快。原因是 re2 生成 NFA 并同时在所有活动状态上行走。 re(如果我是对的)一次只尝试 NFA 中的一条路径,如果失败,它会回溯并尝试另一条路径。这意味着 re 可以做各种奇特的事情,比如前瞻等,因为它总是记住与给定字符串匹配的路径,而 re2 只记住当前的活动状态。这意味着 re2 会告诉您字符串是否与正则表达式匹配,但它无法执行您可以使用 re 对组执行的所有奇特计算。因此,pyre2 具有线性渐近时间复杂度(以不支持内置 re 的某些语法为代价),而 re 具有指数渐近复杂度。然而,这并不意味着对于基本的简单正则表达式,pyre2 必须表现得更好。
还有一件事要记住:
您是否从 facebook 存储库 下载了pyre2或者来自 python 包索引?
如果你从 python 包索引下载,如果它无法处理给定的正则表达式,它将回退到内置的 re 库(所以我想那里可能会有一些小的开销) - 无论如何,如果你匹配的正则表达式是pyre2不支持,它将回退到重新,并且至少不会表现得更好。
因此,如果没有看到你的正则表达式,很难说,但我的猜测是 re2 由于以下原因之一而变慢:
你的正则表达式和你匹配的字符串都非常简单(在这种情况下,没有使用 re2 的优点)
或者你从 pypi 下载了pyre2并且你正在捕获组并且使用lookaheads,re2不支持,它会回退到re)
由于您设法在 C re2 库中编译相同的正则表达式,我猜这是第一个原因,而不是第二个原因。
Well, it depends... pyre2 is asymptotically faster, but not necessarily faster for every specific regex. The reason is that re2 generates a NFA and walks on all active states simultaneously. re (if I am correct) tries only one path in the NFA at a time and if it fails, it backtracks and tries a different one. This means re can do all sorts of fancy stuff like lookaheads etc because it always remembers the path that is matching the given string, while re2 only remembers the current active states. This means that re2 will tell you if the string matches the regex or not, but it is not able to do all the fancy computations you can do with groups using re. As a consequence pyre2 has linear asymptotic time complexity (at the cost of not supporting some of the syntax of the built-in re), while re has exponential asymptotic complexity. However that doesn't mean that for basic simple regexes pyre2 has to perform better.
One more thing to keep in mind:
Did you download pyre2 from the facebook repository or from the python package index?
If you downloaded from the python package index, it will fallback to the built-in re library if it cannot handle the given regex (so I suppose there might be some small overhead there) - in any case if you are matching against regexes that pyre2 doesn't support, it will fall back to re and at least not perform any better.
So without seeing your regexes it is hard to say, but my guess is re2 is being slower for 1 of the following reasons:
either your regexes and strings you're matching against them are very simple (in which case there is no advantage of using re2)
or you downloaded pyre2 from pypi and you are capturing groups and using lookaheads, which re2 doesn't support and it is falling back to re)
Since you managed to compile the same regexes in the C re2 library, I would guess it's the first reason, not the second.