postgresql / typeorm:搜索数组列中 - 仅返回最高数组'路口

发布于 2025-01-31 14:43:03 字数 3444 浏览 1 评论 0 原文

假设我们的图中有2个边缘,每个事件都在它们上观察到许多事件,每个事件都有一个或几个标签:

假设第一个边缘具有8个带有以下标签的事件:ABC ABC AC AC BC AA B.

Second Edge有3个事件:BC,BC,C。

我们希望用户能够

  • 通过给定标签集在每个边缘发生了多少事件
  • ,这些事件并非相互排斥,也不具有严格的层次结构关系。

我们用2个预聚集的表代表该模式:

边缘表:

+----+
| id |   
+----+
| 1  |
| 2  |  
+----+

edgestats表(其中包含通过tag_id与边缘表的关系):

+------+---------+-----------+---------------+
| id   | edge_id | tags      | metric_amount |
+------+---------+-----------+---------------+
| 1    | 1       | [A, B, C] | 7             |
| 2    | 1       | [A, B]    | 7             |
| 3    | 1       | [B, C]    | 5             |
| 4    | 1       | [A, C]    | 6             |
| 5    | 1       | [A]       | 5             |
| 6    | 1       | [B]       | 4             |
| 7    | 1       | [C]       | 4             |
| 8    | 1       | null      | 7             | //null represents aggregated stats for given edge, not important here.
| 9    | 2       | [B, C]    | 3             |
| 10   | 2       | [B]       | 2             |
| 11   | 2       | [C]       | 3             |
| 12   | 2       | null      | 3             |
+------+---------+-----------+---------------+

请注意,当表具有标签[a,b]时,它代表具有任何一个事件的数量与它们关联的标签。因此A或B,或两者兼而有之。

由于用户可以通过这些标签的任何组合过滤,因此DataTeam填充的Edgestats表具有每个给定边缘观察到的标签的所有排列(边缘完全独立于彼此,但是我正在寻找通过一个查询查询所有边缘的方法)。

我需要按照用户选择的标签过滤此表,例如[a,c,d]。问题是我们在数据中没有标签D。预期的返回是:

+------+---------+-----------+---------------+
| id   | edge_id | tags      | metric_amount |
+------+---------+-----------+---------------+
| 4    | 1       | [A, C]    | 6             |
| 11   | 2       | [C]       | 3             |
+------+---------+-----------+---------------+

IE对于每个边缘,这是用户搜索与标签列中的最高匹配子集。没有返回具有ID 5和7的行,因为有关它们的信息已经包含在第4行中。

为什么要返回[A,C] for [a,c,d]搜索?因为由于带有标签D的边缘1上没有数据,因此[a,c]的度量量等于[a,c,d]的数据。

我如何编写查询以返回此内容?


如果您只能回答上面的问题,则可以忽略什么是bellow:

如果我需要在[a],[b]或[a,b]中过滤,那么问题将是微不足道的 - 我可以只搜索精确的数组匹配:

  query.where("edge_stats.tags = :filter",
        {
          filter: [A, B],
        }
      )

但是,在Edgestats表中,我没有所有标签组合用户可以通过搜索(因为它太多),因此我需要找到更多聪明的解决方案。

这是少数可能的解决方案的列表,所有不完美:

  1. 尝试使用用户搜索词的所有子集的精确匹配 - 因此,如果用户通过标签搜索[a,c,d],请先尝试查询[a,c,d],如果没有确切的匹配,尝试[C,D],[A,D],[A,C],瞧,我们得到了比赛!
  2. 使用@>操作员
  .where(
        "edge_stats.tags <@ :tags",
        {
          tags:[A, C, D],
        }
      )

这将返回所有包含A,C或D的行,因此行1,2,3,4,5,7,11,13。然后,可以在代码中滤除所有除最高子集匹配的所有东西。但是使用这种方法,我们无法使用总和和类似的功能,而返回太多行不是好习惯。

  1. 在2)上构建的方法,并受
      .where(
        "edge_stats.tags <@ :tags",
        {
          tags: [A, C, D],
        }
      )
      .addOrderBy("edge.id")
      .addOrderBy("CARDINALITY(edge_stats.tags)", "DESC")
      .distinctOn(["edge.id"]);

它的作用是针对每个边缘的,找到所有包含A,C或D的标签,并获得最高的匹配(高度为最长) )。

因此,返回的行确实是4、11。

这种方法很棒,但是当我将其用作更大查询的一个过滤部分时,我需要添加一堆groupby语句,从本质上讲,它增加了比我想要的要复杂得多。

我想知道是否有一种更简单的方法,它在cool grognal中使用数组中的阵列中的数组中仅获得了最高匹配的方法?

let's say we have 2 edges in a graph, each of them has many events observed on them, each event has one or several tags associated to them:

Let's say the first edge had 8 events with these tags: ABC ABC AC BC A A B.

Second edge had 3 events: BC, BC, C.

We want the user to be able to search

  • how many events occurred on every edge
  • by set of given tags, which are not mutually exclusive, nor they have a strict hierarchical relationship.

We represent this schema with 2 pre-aggregated tables:

Edges table:

+----+
| id |   
+----+
| 1  |
| 2  |  
+----+

EdgeStats table (which contains relation to Edges table via tag_id):

+------+---------+-----------+---------------+
| id   | edge_id | tags      | metric_amount |
+------+---------+-----------+---------------+
| 1    | 1       | [A, B, C] | 7             |
| 2    | 1       | [A, B]    | 7             |
| 3    | 1       | [B, C]    | 5             |
| 4    | 1       | [A, C]    | 6             |
| 5    | 1       | [A]       | 5             |
| 6    | 1       | [B]       | 4             |
| 7    | 1       | [C]       | 4             |
| 8    | 1       | null      | 7             | //null represents aggregated stats for given edge, not important here.
| 9    | 2       | [B, C]    | 3             |
| 10   | 2       | [B]       | 2             |
| 11   | 2       | [C]       | 3             |
| 12   | 2       | null      | 3             |
+------+---------+-----------+---------------+

Note that when table has tag [A, B] for example, it represents amount of events that had either one of this tag associated to them. So A OR B, or both.

Because user can filter by any combination of these tags, DataTeam populated EdgeStats table with all permutations of tags observed per given edge (edges are completely independent of each other, however I am looking for way to query all edges by one query).

I need to filter this table by tags that user selected, let's say [A, C, D]. Problem is we don't have tag D in the data. The expected return is:

+------+---------+-----------+---------------+
| id   | edge_id | tags      | metric_amount |
+------+---------+-----------+---------------+
| 4    | 1       | [A, C]    | 6             |
| 11   | 2       | [C]       | 3             |
+------+---------+-----------+---------------+

i.e. for each edge, the highest matching subset between what user search for and what we have in tags column. Rows with id 5 and 7 were not returned because information about them is already contained in row 4.

Why returning [A, C] for [A, C, D] search? Because since there are no data on edge 1 with tag D, then metric amount for [A, C] equals to the one for [A, C, D].

How do I write query to return this?


If you can just answer the question above, you can ignore what's bellow:

If I needed to filter by [A], [B], or [A, B], problem would be trivial - I could just search for exact array match:

  query.where("edge_stats.tags = :filter",
        {
          filter: [A, B],
        }
      )

However in EdgeStats table I don't have all tags combination user can search by (because it would be too many), so I need to find more clever solution.

Here is list of few possible solutions, all imperfect:

  1. try exact match for all subsets of user's search term - so if user searches by tags [A, C, D], first try querying for [A, C, D], if no exact match, try for [C, D], [A, D], [A, C] and voila we got the match!
  2. use @> operator:
  .where(
        "edge_stats.tags <@ :tags",
        {
          tags:[A, C, D],
        }
      )

This will return all rows which contained either A, C or D, so rows 1,2,3,4,5,7,11,13. Then it would be possible to filter out all but highest subset match in the code. But using this approach, we couldn't use SUM and similar functions, and returning too many rows is not good practice.

  1. approach built on 2) and inspired by this answer:
      .where(
        "edge_stats.tags <@ :tags",
        {
          tags: [A, C, D],
        }
      )
      .addOrderBy("edge.id")
      .addOrderBy("CARDINALITY(edge_stats.tags)", "DESC")
      .distinctOn(["edge.id"]);

What it does is for every edge, find all tags containing either A, C, or D, and gets the highest match (high as array is longest) (thanks to ordering them by cardinality and selecting only one).

So returned rows indeed are 4, 11.

This approach is great, but when I use this as one filtration part of much larger query, I need to add bunch of groupBy statements, and essentially it adds bit more complexity than I would like.

I wonder if there could be a simpler approach which is simply getting highest match of array in table's column with array in query argument?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

蛮可爱 2025-02-07 14:43:03

您的方法#3应该没问题,尤其是如果您在 cardinality(edge_stats.tags)上有索引。然而,

数据填充的edgestats表,所有标签的排列都在给定边缘

,我建议还记录“ tags persive gern”的“ 标签” /em>“,在边缘表中。

这样,您可以

SELECT s.edge_id, s.tags, s.metric_amount
FROM "EdgeStats" s
JOIN "Edges" e ON s.edge_id = e.id
WHERE s.tags = array_intersect(e.observed_tags, $1)

使用 array_intersect 函数。

Your approach #3 should be fine, especially if you have an index on CARDINALITY(edge_stats.tags). However,

DataTeam populated EdgeStats table with all permutations of tags observed per given edge

If you're using a pre-aggregation approach instead of running your queries on the raw data, I would recommend to also record the "tags observed per given edge", in the Edges table.

That way, you can

SELECT s.edge_id, s.tags, s.metric_amount
FROM "EdgeStats" s
JOIN "Edges" e ON s.edge_id = e.id
WHERE s.tags = array_intersect(e.observed_tags, $1)

using the array_intersect function from here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文