URL 的最长前缀匹配

发布于 2024-10-26 12:36:27 字数 593 浏览 1 评论 0原文

我需要有关任何可用于 URL 上的“最长前缀匹配”的标准 python 包的信息。我已经浏览了两个标准包 http://packages.python.org/PyTrie/#pytrie.StringTrie & 'http://pypi.python.org/pypi/trie/0.1.1' 但它们似乎对于 URL 上的最长前缀匹配任务没有用。

例如,如果我的集合有这些 URL 1->http://www.google.com/mail 、 2->http://www.google.com/document 、3->http://www 。 facebook.com 等。

现在,如果我搜索“http://www.google.com/doc”,那么它应该返回 2,搜索“http://www.face”应该返回 3。

我想确认如果有任何标准的 python 包可以帮助我做到这一点,或者我应该实现一个 Trie 来进行前缀匹配。

我并不是在寻找正则表达式类型的解决方案,因为它随着 URL 数量的增加而不可扩展。

多谢。

I need information about any standard python package which can be used for "longest prefix match" on URLs. I have gone through the two standard packages http://packages.python.org/PyTrie/#pytrie.StringTrie & 'http://pypi.python.org/pypi/trie/0.1.1' but they don't seem to be useful for longest prefix match task on URLs.

Examlple, if my set has these URLs 1->http://www.google.com/mail , 2->http://www.google.com/document, 3->http://www.facebook.com, etc..

Now if I search for 'http://www.google.com/doc' then it should return 2 and search for 'http://www.face' should return 3.

I wanted to confirm if there is any standard python package which can help me in doing this or should I implement a Trie for prefix matching.

I am not looking for a regular-expression kind of solution since it is not scalable as the number of URL's increases.

Thanks a lot.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

染墨丶若流云 2024-11-02 12:36:27

性能比较

suffixtreepytrietriedatriestartswith -functions

设置

记录的时间是 3 次重复 1000 次搜索中的最短时间。特里结构构建时间包含在内并分布在所有搜索中。对 1 到 1000000 项的主机名集合执行搜索。

搜索字符串的三种类型:

  • non_existent_key - 字符串没有匹配
  • rare_key - 大约百万分之 20
  • frequent_key - 出现次数为与集合大小相当的

结果

Maximum memory consumption for a million urls:

| function    | memory, | ratio |
|             |     GiB |       |
|-------------+---------+-------|
| suffix_tree |   0.853 |   1.0 |
| pytrie      |   3.383 |   4.0 |
| trie        |   3.803 |   4.5 |
| datrie      |   0.194 |   0.2 |
| startswith  |   0.069 |   0.1 |
#+TBLFM: $3=$2/@3$2;%.1f

要重现结果,运行 trie 基准代码

  • 稀有密钥/不存在的密钥情况

    如果 url 数量小于 10000,则 datrie 是最快的,对于
    N>10000 - 平均而言,suffixtree 更快,startwith 明显更慢。

rare_key

  • 轴:

    • 垂直(时间)刻度约为 1 秒(2**20 微秒)
    • 横轴显示每种情况下的网址总数:N= 1、10、100、1000、10000、100000 和 1000000(一百万)。

nonexistent_key

  • frequent_key

    Upto N=100000 datrie 是最快的(对于一百万个网址,时间为
    由 trie 构建时间主导)。

    在找到的匹配项中查找最长的匹配项所花费的时间最多。因此,所有函数的行为都与预期相似。

frequent_key

startswith - 时间性能与密钥类型无关。

triepytrie 的行为彼此相似。

无需 trie 构建时间的性能

  • datrie - 最快、体面的内存消耗

  • startswith< /code> 在这里处于更不利的地位,因为其他方法不会因构建 trie 所需的时间而受到惩罚。

  • datriepytrietrie - 对于稀有/不存在的密钥几乎为 O(1)(恒定时间)

rare_key_no_trie_build_time
nonexistent_key_no_trie_build_time

frequent_key_no_trie_build_time

拟合(近似)已知函数的多项式进行比较(与图中相同的对数/对数比例):

| Fitting polynom              | Function          |
|------------------------------+-------------------|
| 0.15  log2(N)   +      1.583 | log2(N)           |
| 0.30  log2(N)   +      3.167 | log2(N)*log2(N)   |
| 0.50  log2(N)   +  1.111e-15 | sqrt(N)           |
| 0.80  log2(N)   +  7.943e-16 | N**0.8            |
| 1.00  log2(N)   +  2.223e-15 | N                 |
| 2.00  log2(N)   +  4.446e-15 | N*N               |

Performance comparison

suffixtree vs. pytrie vs. trie vs. datrie vs. startswith -functions

Setup

The recorded time is a minimum time among 3 repetitions of 1000 searches. A trie construction time is included and spread among all searches. The search is performed on collections of hostnames from 1 to 1000000 items.

Three types of a search string:

  • non_existent_key - there is no match for the string
  • rare_key - around 20 in a million
  • frequent_key - number of occurrences is comparable to the collection size

Results

Maximum memory consumption for a million urls:

| function    | memory, | ratio |
|             |     GiB |       |
|-------------+---------+-------|
| suffix_tree |   0.853 |   1.0 |
| pytrie      |   3.383 |   4.0 |
| trie        |   3.803 |   4.5 |
| datrie      |   0.194 |   0.2 |
| startswith  |   0.069 |   0.1 |
#+TBLFM: $3=$2/@3$2;%.1f

To reproduce the results, run the trie benchmark code.

  • rare_key/nonexistent_key case

    if number of urls is less than 10000 then datrie is the fastest, for
    N>10000 - suffixtree is faster, startwith is significally slower on average.

rare_key

  • axes:

    • vertical (time) scale is ~1 second (2**20 microseconds)
    • horizontal axis shows total number of urls in each case: N= 1, 10, 100, 1000, 10000, 100000, and 1000000 (a million).

nonexistent_key

  • frequent_key

    Upto N=100000 datrie is the fastest (for a million urls the time is
    dominated by the trie construction time).

    The most time is taken by finding the longest match among found matches. So all functions behave similar as expected.

frequent_key

startswith - time performance is independent from type of key.

trie and pytrie behave similar to each other.

Performance without trie construction time

  • datrie - the fastest, decent memory consumption

  • startswith is even more at disadvantage here because other approaches are not penalized by the time it takes to build a trie.

  • datrie, pytrie, trie - almost O(1) (constant time) for rare/non_existent key

rare_key_no_trie_build_time
nonexistent_key_no_trie_build_time

frequent_key_no_trie_build_time

Fitting (approximating) polynoms of known functions for comparison (same log/log scale as in figures):

| Fitting polynom              | Function          |
|------------------------------+-------------------|
| 0.15  log2(N)   +      1.583 | log2(N)           |
| 0.30  log2(N)   +      3.167 | log2(N)*log2(N)   |
| 0.50  log2(N)   +  1.111e-15 | sqrt(N)           |
| 0.80  log2(N)   +  7.943e-16 | N**0.8            |
| 1.00  log2(N)   +  2.223e-15 | N                 |
| 2.00  log2(N)   +  4.446e-15 | N*N               |
夏雨凉 2024-11-02 12:36:27

此示例适用于小型 url 列表,但扩展性不佳。

def longest_prefix_match(search, urllist):
    matches = [url for url in urllist if url.startswith(search)]
    if matches:
        return max(matches, key=len)
    else:
        raise Exception("Not found")

使用 trie 模块的实现。

import trie


def longest_prefix_match(prefix_trie, search):
    # There may well be a more elegant way to do this without using
    # "hidden" method _getnode.
    try:
        return list(node.value for node in prefix_trie._getnode(search).walk())
    except KeyError:
        return list()

url_list = [ 
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

url_trie = trie.Trie()

for url in url_list:
    url_trie[url] = url 

searches = ("http", "http://www.go", "http://www.fa", "http://fail")

for search in searches:
    print "'%s' ->" % search, longest_prefix_match(url_trie, search)

结果:

'http' -> ['http://www.facebook.com', 'http://www.google.com/document', 'http://www.google.com/mail']
'http://www.go' -> ['http://www.google.com/document', 'http://www.google.com/mail']
'http://www.fa' -> ['http://www.facebook.com']
'http://fail' -> []

或使用 PyTrie 给出相同的结果,但列表的顺序不同。

from pytrie import StringTrie


url_list = [ 
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

url_trie = StringTrie()

for url in url_list:
    url_trie[url] = url 

searches = ("http", "http://www.go", "http://www.fa", "http://fail")

for search in searches:
    print "'%s' ->" % search, url_trie.values(prefix=search)

我开始认为 基数树/帕特里夏树 从内存使用角度来看会更好看法。基数树如下所示:

Radix tree of example URLs

而特里树看起来更像:
示例 URL 结构

This example is good for small url lists but does not scale well.

def longest_prefix_match(search, urllist):
    matches = [url for url in urllist if url.startswith(search)]
    if matches:
        return max(matches, key=len)
    else:
        raise Exception("Not found")

An implementation using the trie module.

import trie


def longest_prefix_match(prefix_trie, search):
    # There may well be a more elegant way to do this without using
    # "hidden" method _getnode.
    try:
        return list(node.value for node in prefix_trie._getnode(search).walk())
    except KeyError:
        return list()

url_list = [ 
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

url_trie = trie.Trie()

for url in url_list:
    url_trie[url] = url 

searches = ("http", "http://www.go", "http://www.fa", "http://fail")

for search in searches:
    print "'%s' ->" % search, longest_prefix_match(url_trie, search)

Result:

'http' -> ['http://www.facebook.com', 'http://www.google.com/document', 'http://www.google.com/mail']
'http://www.go' -> ['http://www.google.com/document', 'http://www.google.com/mail']
'http://www.fa' -> ['http://www.facebook.com']
'http://fail' -> []

or using PyTrie which gives the same result but the lists are ordered differently.

from pytrie import StringTrie


url_list = [ 
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

url_trie = StringTrie()

for url in url_list:
    url_trie[url] = url 

searches = ("http", "http://www.go", "http://www.fa", "http://fail")

for search in searches:
    print "'%s' ->" % search, url_trie.values(prefix=search)

I'm beginning to think a radix tree / patricia tree would be better from a memory usage point of view. This is what the a radix tree would look like:

Radix tree of example URLs

Whereas the trie looks more like:
trie of example URLs

来日方长 2024-11-02 12:36:27

下面的函数将返回最长匹配的索引。其他有用的信息也可以轻松提取。

from os.path import commonprefix as oscp

def longest_prefix(s, slist):
    pfx_idx = ((oscp([s, url]), i) for i, url in enumerate(slist))
    len_pfx_idx = map(lambda t: (len(t[0]), t[0], t[1]), pfx_idx)
    length, pfx, idx = max(len_pfx_idx)
    return idx

slist = [
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

print(longest_prefix('http://www.google.com/doc', slist))
print(longest_prefix('http://www.face', slist))

The function below will return the index of the longest match. Other useful information can easily be extracted as well.

from os.path import commonprefix as oscp

def longest_prefix(s, slist):
    pfx_idx = ((oscp([s, url]), i) for i, url in enumerate(slist))
    len_pfx_idx = map(lambda t: (len(t[0]), t[0], t[1]), pfx_idx)
    length, pfx, idx = max(len_pfx_idx)
    return idx

slist = [
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

print(longest_prefix('http://www.google.com/doc', slist))
print(longest_prefix('http://www.face', slist))
怪我闹别瞎闹 2024-11-02 12:36:27

如果您愿意用 RAM 来换取时间性能,则 SuffixTree 可能会有用。它具有很好的算法特性,例如它可以在线性时间内解决最长公共子串问题。

如果您总是搜索前缀而不是任意子字符串,那么您可以在填充 SubstringDict() 时添加唯一的前缀:

from SuffixTree import SubstringDict

substr_dict = SubstringDict()
for url in URLS: # urls must be ascii (valid urls are)
    assert '\n' not in url
    substr_dict['\n'+url] = url #NOTE: assume that '\n' can't be in a url

def longest_match(url_prefix, _substr_dict=substr_dict):
    matches = _substr_dict['\n'+url_prefix]
    return max(matches, key=len) if matches else ''

SuffixTree 的这种用法似乎不是最佳的,但它是 20-比 @StephenPaulger 的解决方案 [基于 .startswith()] 根据我尝试过的数据,它可能已经足够好了。

要安装 SuffixTree,请运行:

pip install SuffixTree -f https://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees

If you are willing to trade RAM for the time performance then SuffixTree might be useful. It has nice algorithmic properties such as it allows to solve the longest common substring problem in a linear time.

If you always search for a prefix rather than an arbitrary substring then you could add a unique prefix while populating SubstringDict():

from SuffixTree import SubstringDict

substr_dict = SubstringDict()
for url in URLS: # urls must be ascii (valid urls are)
    assert '\n' not in url
    substr_dict['\n'+url] = url #NOTE: assume that '\n' can't be in a url

def longest_match(url_prefix, _substr_dict=substr_dict):
    matches = _substr_dict['\n'+url_prefix]
    return max(matches, key=len) if matches else ''

Such usage of SuffixTree seems suboptimal but it is 20-150 times faster (without SubstringDict()'s construction time) than @StephenPaulger's solution [which is based on .startswith()] on the data I've tried and it could be good enough.

To install SuffixTree, run:

pip install SuffixTree -f https://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文