URL 的最长前缀匹配
我需要有关任何可用于 URL 上的“最长前缀匹配”的标准 python 包的信息。我已经浏览了两个标准包 http://packages.python.org/PyTrie/#pytrie.StringTrie & 'http://pypi.python.org/pypi/trie/0.1.1' 但它们似乎对于 URL 上的最长前缀匹配任务没有用。
例如,如果我的集合有这些 URL 1->http://www.google.com/mail 、 2->http://www.google.com/document 、3->http://www 。 facebook.com 等。
现在,如果我搜索“http://www.google.com/doc”,那么它应该返回 2,搜索“http://www.face”应该返回 3。
我想确认如果有任何标准的 python 包可以帮助我做到这一点,或者我应该实现一个 Trie 来进行前缀匹配。
我并不是在寻找正则表达式类型的解决方案,因为它随着 URL 数量的增加而不可扩展。
多谢。
I need information about any standard python package which can be used for "longest prefix match" on URLs. I have gone through the two standard packages http://packages.python.org/PyTrie/#pytrie.StringTrie & 'http://pypi.python.org/pypi/trie/0.1.1' but they don't seem to be useful for longest prefix match task on URLs.
Examlple, if my set has these URLs 1->http://www.google.com/mail , 2->http://www.google.com/document, 3->http://www.facebook.com, etc..
Now if I search for 'http://www.google.com/doc' then it should return 2 and search for 'http://www.face' should return 3.
I wanted to confirm if there is any standard python package which can help me in doing this or should I implement a Trie for prefix matching.
I am not looking for a regular-expression kind of solution since it is not scalable as the number of URL's increases.
Thanks a lot.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
性能比较
suffixtree
与pytrie
与trie
与datrie
与startswith
-functions设置
记录的时间是 3 次重复 1000 次搜索中的最短时间。特里结构构建时间包含在内并分布在所有搜索中。对 1 到 1000000 项的主机名集合执行搜索。
搜索字符串的三种类型:
non_existent_key
- 字符串没有匹配rare_key
- 大约百万分之 20frequent_key
- 出现次数为与集合大小相当的结果
Maximum memory consumption for a million urls:
要重现结果,运行 trie 基准代码。
稀有密钥/不存在的密钥情况
如果 url 数量小于 10000,则 datrie 是最快的,对于
N>10000 - 平均而言,
suffixtree
更快,startwith
明显更慢。轴:
frequent_key
Upto N=100000
datrie
是最快的(对于一百万个网址,时间为由 trie 构建时间主导)。
在找到的匹配项中查找最长的匹配项所花费的时间最多。因此,所有函数的行为都与预期相似。
startswith
- 时间性能与密钥类型无关。trie
和pytrie
的行为彼此相似。无需 trie 构建时间的性能
datrie
- 最快、体面的内存消耗startswith< /code> 在这里处于更不利的地位,因为其他方法不会因构建 trie 所需的时间而受到惩罚。
datrie
、pytrie
、trie
- 对于稀有/不存在的密钥几乎为 O(1)(恒定时间)拟合(近似)已知函数的多项式进行比较(与图中相同的对数/对数比例):
Performance comparison
suffixtree
vs.pytrie
vs.trie
vs.datrie
vs.startswith
-functionsSetup
The recorded time is a minimum time among 3 repetitions of 1000 searches. A trie construction time is included and spread among all searches. The search is performed on collections of hostnames from 1 to 1000000 items.
Three types of a search string:
non_existent_key
- there is no match for the stringrare_key
- around 20 in a millionfrequent_key
- number of occurrences is comparable to the collection sizeResults
Maximum memory consumption for a million urls:
To reproduce the results, run the trie benchmark code.
rare_key/nonexistent_key case
if number of urls is less than 10000 then datrie is the fastest, for
N>10000 -
suffixtree
is faster,startwith
is significally slower on average.axes:
frequent_key
Upto N=100000
datrie
is the fastest (for a million urls the time isdominated by the trie construction time).
The most time is taken by finding the longest match among found matches. So all functions behave similar as expected.
startswith
- time performance is independent from type of key.trie
andpytrie
behave similar to each other.Performance without trie construction time
datrie
- the fastest, decent memory consumptionstartswith
is even more at disadvantage here because other approaches are not penalized by the time it takes to build a trie.datrie
,pytrie
,trie
- almost O(1) (constant time) for rare/non_existent keyFitting (approximating) polynoms of known functions for comparison (same log/log scale as in figures):
此示例适用于小型 url 列表,但扩展性不佳。
使用 trie 模块的实现。
结果:
或使用 PyTrie 给出相同的结果,但列表的顺序不同。
我开始认为 基数树/帕特里夏树 从内存使用角度来看会更好看法。基数树如下所示:
而特里树看起来更像:
This example is good for small url lists but does not scale well.
An implementation using the trie module.
Result:
or using PyTrie which gives the same result but the lists are ordered differently.
I'm beginning to think a radix tree / patricia tree would be better from a memory usage point of view. This is what the a radix tree would look like:
Whereas the trie looks more like:
下面的函数将返回最长匹配的索引。其他有用的信息也可以轻松提取。
The function below will return the index of the longest match. Other useful information can easily be extracted as well.
如果您愿意用 RAM 来换取时间性能,则
SuffixTree
可能会有用。它具有很好的算法特性,例如它可以在线性时间内解决最长公共子串问题。如果您总是搜索前缀而不是任意子字符串,那么您可以在填充
SubstringDict()
时添加唯一的前缀:SuffixTree
的这种用法似乎不是最佳的,但它是 20-比 @StephenPaulger 的解决方案 [基于.startswith()
] 根据我尝试过的数据,它可能已经足够好了。要安装
SuffixTree
,请运行:If you are willing to trade RAM for the time performance then
SuffixTree
might be useful. It has nice algorithmic properties such as it allows to solve the longest common substring problem in a linear time.If you always search for a prefix rather than an arbitrary substring then you could add a unique prefix while populating
SubstringDict()
:Such usage of
SuffixTree
seems suboptimal but it is 20-150 times faster (withoutSubstringDict()
's construction time) than @StephenPaulger's solution [which is based on.startswith()
] on the data I've tried and it could be good enough.To install
SuffixTree
, run: