Hashmap(O(1))支持Joker/Match-All键

发布于 2025-01-19 17:37:27 字数 1356 浏览 2 评论 0原文

标题不太清楚,因为我不能将问题放在句子中(如果您有更好的标题,请建议)。我将尝试用一个示例来阐明我的要求:

假设我有一个表格:

| Origin | Destination | Airline   | Free Baggage |
===================================================
| NYC    | London      | American  | 20KG         |
---------------------------------------------------
| NYC    | *           | Southwest | 30KG         |
---------------------------------------------------
| *      | *           | Southwest | 25KG         |
---------------------------------------------------
| *      | LA          | *         | 20KG         |
---------------------------------------------------
| *      | *           | *         | 15KG         |
---------------------------------------------------
and so on ...

此表描述了航空公司在不同路线中提供的免费行李金额。您会看到一些行具有*值,这意味着它们匹配所有可能的值(这些值不一定是知道的)。

因此,我们有大量的行李规则(如上表)和大量航班(他们的起源,目的地和航空公司闻名),我们打算要以最有效的方式找到每架航班的行李金额(迭代列表不是一种有效的方式,显然,它将花费o(n)计算)。它 在每次飞行中都可能存在多个结果,但是在这种情况下,第一个匹配或最具体的匹配将是首选的(以更简单的方式继续进行) 。

如果表中没有*符号,则问题很容易,我们可以使用hashmapdictionary 值的元组作为钥匙。但是,由于存在这些*(假设All)键,为此提供一般解决方案并不是那么直接。

请注意,上面的示例只是一个示例,我需要一个解决方案,该解决方案可用于任何数量的密钥,而不仅仅是三个密钥。

您对此问题有任何想法或实现,使用具有时间复杂性或接近o(1)的查找方法就像常规的hashmap一样(内存不是问题)?最好的解决方案是什么?

The title is not so clear, because I cannot put my problem in a sentence (If you have a better title for this question, please suggest). I'll try to clarify my requirement with an example:

Suppose I have a table like this:

| Origin | Destination | Airline   | Free Baggage |
===================================================
| NYC    | London      | American  | 20KG         |
---------------------------------------------------
| NYC    | *           | Southwest | 30KG         |
---------------------------------------------------
| *      | *           | Southwest | 25KG         |
---------------------------------------------------
| *      | LA          | *         | 20KG         |
---------------------------------------------------
| *      | *           | *         | 15KG         |
---------------------------------------------------
and so on ...

This table describes free baggage amount that the airlines provide in different routes. You can see that some rows have * value, meaning that they match all possible values (those values are not known necessarily).

So we have a large list of baggage rules (like the table above) and a large list of flights (which their origin, destination and airline is known), and we intend to find the baggage amount for each one of flights in the most efficient way (iterating the list is not an efficient way, obviously, as it will cost an O(N) computation). It is possible to exist more than one result for each flight, but we will assume that in this case either the first matching or the most specific one will be preferred (whichever is simpler for you to continue with).

If there was not * signs in the table, the problem would be easy, and we could use a Hashmap or Dictionary with a Tuple of values as a key. But with presence of those * (lets say match-all) keys, it is not so straight forward to provide a general solution for that.

Please note that the above example was just an example, and I need a solution that can be used for any number of keys, not just three.

Do you have any idea or implementation for this problem, with a lookup method having time complexity equal or close to O(1) like a regular hashmap (memory will not be an issue)? What would be the best possible solution?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

久光 2025-01-26 17:37:27

关于评论,我对它的考虑越多,它看起来就越像一个带有索引而不是hashmap的关系数据库...

一个琐碎的,非常简单的解决方案可能是内存sqlite数据库。但这可能是o(log2(n))中的东西,而不是o(1)。主要优点是它很容易设置, 表演足够好,它可能是最终解决方案。
在这里,关键是使用适当的索引,喜欢运算符,当然定义明确的JOIN条款。


从头开始,我想不出任何解决方案,具有n行和m列,至少不在o(m) ...但是通常,您的列比行要少。很快 - 我可能已经跳过了一个细节,我可以在即时写下 - 我可以向您提出此算法 /容器:

  1. 数据必须存储在类似向量的容器中vecdata < / code < / code>,通过o(1)的简单索引。将其视为数据库中的A 主键,我们将其称为pk。知道pko(1)中立即为您提供所需的数据。您将拥有n行总数。


  2. 对于不包含任何*的每一行,您将在一个称为mainhash的真实hashmap中插入,配对(&lt; tuple&gt;,pk)。为了确切的结果,这是您的主要索引。它将在o(1)中,但您要求的内容可能不在...显然,您必须保持Mainhashvecdata,使用所需的一切(静音,锁,不在乎两者都一致)。
    这最多包含n条目。没有任何小丑,它将接近作为标准的哈希图,但对于vecdata的间接方式。它仍然是o(1)在这种情况下。


  3. 对于每个可搜索的列,您将构建一个专用于本列的特定索引。
    该索引具有n条目。它将是标准的hashmap,但是必须允许给定键的多个值。这是一个很常见的容器,因此这不是问题。
    对于每一行,索引条目将是:(&lt; vecdata value&gt; pk)。该容器存储在索引向量中,index [i](带有0&lt; = i&lt; m)。
    Mainhash相同,必须执行一致性。

显然,当将条目插入vecdata < / code>时,应构建所有这些索引 /子字符,并在需要时在磁盘上保存 - 您每次启动应用程序时都不想重新构造全部。 ..


搜索一行

,因此用户搜索给定的元组。

  1. mainhash中搜索它。如果发现,请返回,搜索完成。
    升级(请参见下文):在转到步骤2。

    之前,也搜索缓存

  2. 对于每个元组元素元组[0&lt; = i&lt; m],在index [i]中搜索两个tuple [i]> (返回pk的向量,精确[i] *(返回pk ,fuzzy [i])。

  3. 使用这两个向量,构建另一个(临时)哈希tmphash,关联(pk,integer count)。它很简单:count初始化为1如果条目来自exact,而0如果它来自<代码>模糊。

  4. 对于下一列,构建精确fuzzy(请参阅#2)。但是,与其制作新的tmphash,您将 合并 将结果纳入而不是创建新的临时哈希。
    方法是:如果tmphash没有此pk条目,则垃圾端此条目:它根本无法匹配。否则,请读取count值,将10根据其来自何处添加,在tmphash 。


  5. 完成所有列后,您必须分析tmphash

首先分析tmphash

,如果tmphash是空的,则您没有任何合适的答案。将其返回给用户。如果仅包含一个条目,则相同:直接返回用户。
对于tmphash中的多个元素:

  • 解析整个tmphash容器,搜索最大count。在内存中维护pkcount的当前最大值关联。
  • 开发器的选择:如果多个计数以相同的最大值,您可以将它们全部返回,返回第一个或最后一个。
  • Count显然总是比M - 否则,您会在Mainhash中找到元组。与m相比,此值可以给您的结果给出信心标记(= 100*Count/M%的信心百分比)。
  • 现在,您也可以将搜索的原始元组存储在另一个称为cache的hashmap中。
    由于在vecdata中添加/修改某些内容时,要正确更新缓存,这太复杂了,因此只需清除cache发生时即可。毕竟,这只是一个缓存...

如果该语言不帮助您,这是非常复杂的,特别是通过重新定义操作员并拥有所有基本容器可用,但应该可以使用。

精确匹配 /缓存匹配在o(1)< / code>中。模糊搜索在o(nm)中,n是匹配行的数量(和0&lt; = n&lt; n&lt; n当然)。

没有进一步的研究,我看不出比这更好的事情。它会消耗淫秽的记忆力,但您说这不是问题。

Regarding the comments, the more I think about it, and the more it looks like a relational database with indexes rather than an hashmap...

A trivial, quite easy solution could be something like an In-memory SQlite database. But it would probably be something in O(log2(n)), and not O(1). The main advantage is that it's easy to set up, and IF performances are good enough, it could be the final solution.
Here, key is to use proper indexes, the LIKE operator, and of course well-defined JOIN clauses.


From scratch, I can't think about any solution that, having N rows and M columns, isn't at least in O(M)... But usually, you'll have way less columns than rows. Quickly - I may have skipped a detail, I write that on-the-fly - I can propose you this algorithm / container:

  1. Data must be stored in a vector-like container VECDATA, accessed by a simple index in O(1). Think about this as a primary key in databases, and we'll call it PK. Knowing PK gives you instantly, in O(1), the required data. You'll have N rows grand total.

  2. For each row NOT containing any *, you'll insert in a real hashmap called MAINHASH the pair (<tuple>, PK). This is your primary index, for exact results. It will be in O(1), BUT what you requested may not be within... Obviously, you must maintain consistency between MAINHASH and VECDATA, with whatever is needed (mutexes, locks, don't care as long as both are consistents).
    This hash contains at most N entries. Without any joker, it will act near as a standard hashmap, but for the indirection to VECDATA. It's still O(1) in this case.

  3. For each searchable column, you'll build a specific index, dedicated to this column.
    The index has N entries. It will be a standard hashmap, but it MUST allow multiple values for a given key. That's quite a common container, so it shouldn't be an issue.
    For each row, the index entry will be: ( <VECDATA value>, PK ). The container is stored in a vector of indexes, INDEX[i] (with 0<=i<M).
    Same as MAINHASH, consistency must be enforced.

Obviously, all these indexes / subcontainers should be constructed when an entry is inserted into VECDATA, and saved on disk across sessions if needed - you don't want to reconstruct all this each time you start the application...


Searching a row

So, user search for a given tuple.

  1. Search it in MAINHASH. If found, return it, search done.
    Upgrade (see below): search also in CACHE before going to step #2.

  2. For each tuple element tuple[0<=i<M], search in INDEX[i] for both tuple[i] (returns a vector of PK, EXACT[i]) AND for * (returns another vector of PK, FUZZY[i]).

  3. With these two vectors, build another (temporary) hash TMPHASH, associating ( PK, integer COUNT ). It quite simple: COUNT is initialized to 1 if entry comes from EXACT, and 0 if it comes from FUZZY.

  4. For next column, build EXACT and FUZZY (see #2). But instead of making a new TMPHASH, you'll MERGE the results into rather than creating a new temporary hash.
    Method is: if TMPHASH doesn't have this PK entry, trash this entry: it can't match at all. Otherwise, read the COUNT value, add 1 or 0 to it according to where it comes from, reinject it in TMPHASH.

  5. Once all columns are done, you'll have to analyze TMPHASH.

Analyzing TMPHASH

First, if TMPHASH is empty, then you don't have any suitable answer. Return that to user. If it contains only one entry, same: return to user directly.
For more than one element in TMPHASH:

  • Parse the whole TMPHASH container, searching for the maximum COUNT. Maintain in memory the PK associated to the current maximum for COUNT.
  • Developper's choice: in case of multiple COUNT at the same maximum value, you can either return them all, return the first one, or the last one.
  • COUNT if obviously always stricly lower than M - otherwise, you would have found the tuple in MAINHASH. This value, compared to M, can give a confidence mark to your result (=100*COUNT/M% of confidence).
  • You can also now store the original tuple searched, and the corresponding PK, in another hashmap called CACHE.
    Since it would be way too complicated to update properly CACHE when adding/modifying something in VECDATA, simply purge CACHE when it occurs. It's only a cache, after all...

This is quite complex to implement if the language doesn't help you, in particular by allowing to redefine operators and having all base containers available, but it should work.

Exact matches / cached matches are in O(1). Fuzzy search is in O(n.M), n being the number of matching rows (and 0<=n<N, of course).

Without further researchs, I can't see anything better than that. It will consume an obscene amount of memory, but you said that it won't be an issue.

嘿看小鸭子会跑 2025-01-26 17:37:27

我建议使用带有一些数据修饰的 Trie 来执行此操作。对于路线,您想知道最低的路线 ID,以便我们可以匹配第一个可用的路线。对于航班,您想要跟踪还有多少航班可以匹配。

例如,这将允许您在比赛进行到一半时仅意识到从 city1 到 city2 的航班可能与从 city1、city2city1, * 出发的航线相匹配。*, city2*, *,而无需为每条航线或航班重复该逻辑。

以下是 Python 中的概念证明:

import heapq
import weakref

class Flight:
    def __init__(self, fields, flight_no):
        self.fields = fields
        self.flight_no = flight_no

class Route:
    def __init__(self, route_id, fields, baggage):
        self.route_id = route_id
        self.fields = fields
        self.baggage = baggage

class SearchTrie:
    def __init__(self, value=0, item=None, parent=None):
        # value = # unmatched flights for flights
        # value = lowest route id for routes.
        self.value = value
        self.item = item
        self.trie = {}
        self.parent = None
        if parent:
            self.parent = weakref.ref(parent)

    def add_flight (self, flight, i=0):
        self.value += 1
        fields = flight.fields
        if i < len(fields):
            if fields[i] not in self.trie:
                self.trie[fields[i]] = SearchTrie(0, None, self)
            self.trie[fields[i]].add_flight(flight, i+1)
        else:
            self.item = flight

    def remove_flight(self):
        self.value -= 1
        if self.parent and self.parent():
            self.parent().remove_flight()

    def add_route (self, route, i=0):
        route_id = route.route_id
        fields = route.fields
        if i < len(fields):
            if fields[i] not in self.trie:
                self.trie[fields[i]] = SearchTrie(route_id)
            self.trie[fields[i]].add_route(route, i+1)
        else:
            self.item = route

    def match_flight_baggage(route_search, flight_search):
        # Construct a heap of one search to do.
        tmp_id = 0
        todo = [((0, tmp_id), route_search, flight_search)]
        # This will hold by flight number, baggage.
        matched = {}

        while 0 < len(todo):
            priority, route_search, flight_search = heapq.heappop(todo)
            if 0 == flight_search.value: # There are no flights left to match
                # Already matched all flights.
                pass
            elif flight_search.item is not None:
                # We found a match!
                matched[flight_search.item.flight_no] = route_search.item.baggage
                flight_search.remove_flight()
            else:
                for key, r_search in route_search.trie.items():
                    if key == '*': # Found wildcard.
                        for a_search in flight_search.trie.values():
                            if 0 < a_search.value:
                                heapq.heappush(todo, ((r_search.value, tmp_id), r_search, a_search))
                                tmp_id += 1
                    elif key in flight_search.trie and 0 < flight_search.trie[key].value:
                        heapq.heappush(todo, ((r_search.value, tmp_id), r_search, flight_search.trie[key]))
                        tmp_id += 1

        return matched

# Sample data - the id is the position.
route_data = [
    ["NYC", "London", "American", "20KG"],
    ["NYC", "*", "Southwest", "30KG"],
    ["*", "*", "Southwest", "25KG"],
    ["*", "LA", "*", "20KG"],
    ["*", "*", "*", "15KG"],
]
routes = []
for i in range(len(route_data)):
    data = route_data[i]
    routes.append(Route(i, [data[0], data[1], data[2]], data[3]))

flight_data = [
    ["NYC", "London", "American"],
    ["NYC", "Dallas", "Southwest"],
    ["Dallas", "Houston", "Southwest"],
    ["Denver", "LA", "American"],
    ["Denver", "Houston", "American"],
]
flights = []
for i in range(len(flight_data)):
    data = flight_data[i]
    flights.append(Flight([data[0], data[1], data[2]], i))

# Convert to searches.
flight_search = SearchTrie()
for flight in flights:
    flight_search.add_flight(flight)

route_search = SearchTrie()
for route in routes:
    route_search.add_route(route)


print(route_search.match_flight_baggage(flight_search))

I would recommend doing this with Tries that have a little data decorated. For routes, you want to know the lowest route ID so we can match to the first available route. For flights you want to track how many flights there are left to match.

What this will allow you to do, for instance, is partway through the match ONLY ONCE realize that flights from city1 to city2 might be matching routes that start off city1, city2, or city1, * or *, city2, or *, * without having to repeat that logic for each route or flight.

Here is a proof of concept in Python:

import heapq
import weakref

class Flight:
    def __init__(self, fields, flight_no):
        self.fields = fields
        self.flight_no = flight_no

class Route:
    def __init__(self, route_id, fields, baggage):
        self.route_id = route_id
        self.fields = fields
        self.baggage = baggage

class SearchTrie:
    def __init__(self, value=0, item=None, parent=None):
        # value = # unmatched flights for flights
        # value = lowest route id for routes.
        self.value = value
        self.item = item
        self.trie = {}
        self.parent = None
        if parent:
            self.parent = weakref.ref(parent)

    def add_flight (self, flight, i=0):
        self.value += 1
        fields = flight.fields
        if i < len(fields):
            if fields[i] not in self.trie:
                self.trie[fields[i]] = SearchTrie(0, None, self)
            self.trie[fields[i]].add_flight(flight, i+1)
        else:
            self.item = flight

    def remove_flight(self):
        self.value -= 1
        if self.parent and self.parent():
            self.parent().remove_flight()

    def add_route (self, route, i=0):
        route_id = route.route_id
        fields = route.fields
        if i < len(fields):
            if fields[i] not in self.trie:
                self.trie[fields[i]] = SearchTrie(route_id)
            self.trie[fields[i]].add_route(route, i+1)
        else:
            self.item = route

    def match_flight_baggage(route_search, flight_search):
        # Construct a heap of one search to do.
        tmp_id = 0
        todo = [((0, tmp_id), route_search, flight_search)]
        # This will hold by flight number, baggage.
        matched = {}

        while 0 < len(todo):
            priority, route_search, flight_search = heapq.heappop(todo)
            if 0 == flight_search.value: # There are no flights left to match
                # Already matched all flights.
                pass
            elif flight_search.item is not None:
                # We found a match!
                matched[flight_search.item.flight_no] = route_search.item.baggage
                flight_search.remove_flight()
            else:
                for key, r_search in route_search.trie.items():
                    if key == '*': # Found wildcard.
                        for a_search in flight_search.trie.values():
                            if 0 < a_search.value:
                                heapq.heappush(todo, ((r_search.value, tmp_id), r_search, a_search))
                                tmp_id += 1
                    elif key in flight_search.trie and 0 < flight_search.trie[key].value:
                        heapq.heappush(todo, ((r_search.value, tmp_id), r_search, flight_search.trie[key]))
                        tmp_id += 1

        return matched

# Sample data - the id is the position.
route_data = [
    ["NYC", "London", "American", "20KG"],
    ["NYC", "*", "Southwest", "30KG"],
    ["*", "*", "Southwest", "25KG"],
    ["*", "LA", "*", "20KG"],
    ["*", "*", "*", "15KG"],
]
routes = []
for i in range(len(route_data)):
    data = route_data[i]
    routes.append(Route(i, [data[0], data[1], data[2]], data[3]))

flight_data = [
    ["NYC", "London", "American"],
    ["NYC", "Dallas", "Southwest"],
    ["Dallas", "Houston", "Southwest"],
    ["Denver", "LA", "American"],
    ["Denver", "Houston", "American"],
]
flights = []
for i in range(len(flight_data)):
    data = flight_data[i]
    flights.append(Flight([data[0], data[1], data[2]], i))

# Convert to searches.
flight_search = SearchTrie()
for flight in flights:
    flight_search.add_flight(flight)

route_search = SearchTrie()
for route in routes:
    route_search.add_route(route)


print(route_search.match_flight_baggage(flight_search))
岛歌少女 2025-01-26 17:37:27

正如Wisblade在答案中注意到的那样,对于n行和m列的数组,最好的复杂性是o(M)。仅当您考虑m是常数时,才能获得o(1)

您可以在o(2^m)中轻松解决您的问题,该对于小m是实用的,并且有效地o(1)(如果您)考虑m是一个常数。

创建一个单个hashmap,其中包含(作为键)串联列值的字符串,可能由某些特殊字符隔开,例如斜线:

map.put("NYC/London/American", "20KG");
map.put("NYC/*/Southwest", "30KG");
map.put("*/*/Southwest", "25KG");
map.put("*/LA/*", "20KG");
map.put("*/*/*", "15KG");

然后,当查询时,您可以尝试使用实际数据和通配符字符的不同组合。例如,假设您要查询nyc/la/southwest;然后,您尝试以下组合:

map.get("NYC/LA/Southwest"); // null
map.get("NYC/LA/*"); // null
map.get("NYC/*/Southwest"); // found: 30KG

如果第三步中的答案为null,则将继续如下:

map.get("NYC/*/*"); // null
map.get("*/LA/Southwest"); // null
map.get("*/LA/*"); // found: 20KG

仍然存在两个选项:

map.get("*/*/Southwest"); // found: 25KG
map.get("*/*/*"); // found: 15KG

基本上,对于三个数据列,您有8个可能检查hashmap的可能性 - 还不错!而且您可能更早地找到答案。

As Wisblade notices in his answer, for an array of N rows and M columns the best possible complexity is O(M). You can get O(1) only if you consider M to be a constant.

You can easily solve your problem in O(2^M) which is practical for a small M and is effectively O(1) if you consider M to be a constant.

Create a single hashmap which contains (as keys) strings of concatenated column values, possibly separated by some special character, e.g. a slash:

map.put("NYC/London/American", "20KG");
map.put("NYC/*/Southwest", "30KG");
map.put("*/*/Southwest", "25KG");
map.put("*/LA/*", "20KG");
map.put("*/*/*", "15KG");

Then, when you query, you try different combinations of actual data and wildcard characters. E.g. let's assume you want to query NYC/LA/Southwest; then you try the following combinations:

map.get("NYC/LA/Southwest"); // null
map.get("NYC/LA/*"); // null
map.get("NYC/*/Southwest"); // found: 30KG

If the answer in the third step was null, you would continue as follows:

map.get("NYC/*/*"); // null
map.get("*/LA/Southwest"); // null
map.get("*/LA/*"); // found: 20KG

And there still remain two options:

map.get("*/*/Southwest"); // found: 25KG
map.get("*/*/*"); // found: 15KG

Basically, for three data columns you have 8 possibilities to check in the hashmap -- not bad! and possibly you find an answer much earlier.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文