报告查询：连接多个事实表的最佳方式？

发布于 2024-07-17 11:55:52 字数 2126 浏览 18 评论 0原文

我正在开发一个报告系统，该系统允许用户任意查询一组事实表，并限制每个事实表的多个维度表。我编写了一个查询构建器类，它根据约束参数自动组装所有正确的联接和子查询，并且一切都按设计进行。

但是，我有一种感觉，我没有生成最有效的查询。在一组拥有几百万条记录的表上，这些查询运行大约需要 10 秒，我希望将它们控制在不到一秒的范围内。我有一种感觉，如果我能摆脱子查询，结果会更有效。

我不会向您展示我的实际架构（这要复杂得多），而是向您展示一个类似的示例来说明这一点，而无需解释我的整个应用程序和数据模型。

想象一下，我有一个包含艺术家和场地的音乐会信息数据库。用户可以任意标记艺术家和场地。所以架构看起来像这样：

concert
  id
  artist_id
  venue_id
  date

artist
  id
  name

venue
  id
  name

tag
  id
  name

artist_tag
  artist_id
  tag_id

venue_tag
  venue_id
  tag_id

非常简单。

现在假设我想查询数据库，查找今天一个月内发生的所有音乐会，以及所有带有“techno”和“trombone”标签的艺术家，在带有“cheap-beer”和“great-mosh-pits”标签的音乐会上表演的艺术家。

我能想到的最好的查询如下所示：

SELECT
  concert.id AS concert_id,
  concert.date AS concert_date,
  artist.id AS artist_id,
  artist.name AS artist_name,
  venue.id AS venue_id,
  venue.name AS venue_name,
FROM
  concert
INNER JOIN (
  artist ON artist.id = concert.artist_id
) INNER JOIN (
  venue ON venue.id = concert.venue_id
)
WHERE (
  artist.id IN (
    SELECT artist_id
    FROM artist_tag
    INNER JOIN tag AS a on (
      a.id = artist_tag.tag_id
      AND
      a.name = 'techno'
    ) INNER JOIN tag AS b on (
      b.id = artist_tag.tag_id
      AND
      b.name = 'trombone'
    )
  )
  AND
  venue.id IN (
    SELECT venue_id
    FROM venue_tag
    INNER JOIN tag AS a on (
      a.id = venue_tag.tag_id
      AND
      a.name = 'cheap-beer'
    ) INNER JOIN tag AS b on (
      b.id = venue_tag.tag_id
      AND
      b.name = 'great-mosh-pits'
    )
  )
  AND
  concert.date BETWEEN NOW() AND (NOW() + INTERVAL 1 MONTH)
)

查询有效，但我真的不喜欢有这些多个子查询。如果我可以纯粹使用 JOIN 逻辑来完成相同的逻辑，我有一种感觉，性能会大大提高。

在完美的世界中，我会使用真正的 OLAP 服务器。但我的客户将部署到 MySQL、MSSQL 或 Postgres，我不能保证兼容的 OLAP 引擎可用。所以我只能使用带有星型模式的普通 RDBMS。

不要太关注这个示例的细节（我的实际应用程序与音乐无关，但它有多个事实表，与我在此处显示的表具有类似的关系）。在此模型中，“artist_tag”和“venue_tag”表充当事实表，其他所有内容都是维度。

值得注意的是，在这个示例中，如果我只允许用户限制单个 Artist_tag 或venue_tag 值，则查询编写起来要简单得多。只有当我允许查询包含 AND 逻辑（需要多个不同的标签）时，事情才会变得非常棘手。

所以，我的问题是：您知道针对多个事实表编写高效查询的最佳技术是什么？

原文

I'm working on a reporting system that allows the user to arbitrarily query a set of fact tables, constraining on multiple dimension tables for each fact table. I've written a query-builder class that automatically assembles all the correct joins and subqueries based on the constraint parameters, and everything works as designed.

But, I have a feeling that I'm not generating the most efficient queries. On a set of tables with a few million records, these queries take about 10 seconds to run, and I'd like to get them down in the less-than-one-second range. I have a feeling that, if I could get rid of the subqueries, the result would be much more efficient.

Rather than show you my actual schema (which is much more complicated), I'll show you an analogous example that illustrates the point without having to explain my whole application and data model.

Imagine that I have a database of concert information, with artists and venues. Users can arbitrarily tag the artists and the venues. So the schema looks like this:

concert
  id
  artist_id
  venue_id
  date

artist
  id
  name

venue
  id
  name

tag
  id
  name

artist_tag
  artist_id
  tag_id

venue_tag
  venue_id
  tag_id

Pretty simple.

Now let's say I want to query the database for all concerts happening within one month of today, for all artists with 'techno' and 'trombone' tags, performing at concerts with 'cheap-beer' and 'great-mosh-pits' tag.

The best query I've been able to come up with looks like this:

SELECT
  concert.id AS concert_id,
  concert.date AS concert_date,
  artist.id AS artist_id,
  artist.name AS artist_name,
  venue.id AS venue_id,
  venue.name AS venue_name,
FROM
  concert
INNER JOIN (
  artist ON artist.id = concert.artist_id
) INNER JOIN (
  venue ON venue.id = concert.venue_id
)
WHERE (
  artist.id IN (
    SELECT artist_id
    FROM artist_tag
    INNER JOIN tag AS a on (
      a.id = artist_tag.tag_id
      AND
      a.name = 'techno'
    ) INNER JOIN tag AS b on (
      b.id = artist_tag.tag_id
      AND
      b.name = 'trombone'
    )
  )
  AND
  venue.id IN (
    SELECT venue_id
    FROM venue_tag
    INNER JOIN tag AS a on (
      a.id = venue_tag.tag_id
      AND
      a.name = 'cheap-beer'
    ) INNER JOIN tag AS b on (
      b.id = venue_tag.tag_id
      AND
      b.name = 'great-mosh-pits'
    )
  )
  AND
  concert.date BETWEEN NOW() AND (NOW() + INTERVAL 1 MONTH)
)

The query works, but I really don't like having those multiple subqueries. If I could accomplish the same logic purely using JOIN logic, I have a feeling the performance would drastically improve.

In a perfect world, I'd be using a real OLAP server. But my customers will be deploying to MySQL or MSSQL or Postgres, and I can't guarantee that a compatible OLAP engine will be available. So I'm stuck using an ordinary RDBMS with a star schema.

Don't get too hung up on the details of this example (my real application has nothing to do with music, but it has multiple fact tables with an analogous relationship to the ones I've shown here). In this model, the 'artist_tag' and 'venue_tag' tables function as fact tables, and everything else is a dimension.

It's important to note, in this example, that the queries are much simpler to write if I only allow the user to constrain against a single artist_tag or venue_tag value. It only gets really tricky when I allow the queries to include AND logic, requiring multiple distinct tags.

So, my question is: what are the best techniques that you know of for writing efficient queries against multiple fact tables?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

牛↙奶布丁 2024-07-24 11:55:52

我的方法更通用一些，将过滤参数放在表中，然后使用 GROUP BY、HAVING 和 COUNT 来过滤结果。我已经多次使用这种基本方法来进行一些非常复杂的“搜索”，并且效果非常好（对我来说咧嘴笑）。

我最初也不加入艺术家和场地维度表。我会得到 id 的结果（只需要艺术家标签和场地标签），然后将结果加入艺术家和场地表中以获得这些维度值。（基本上，在子查询中搜索实体 id，然后在外部查询中获取所需的维度值。将它们分开应该会改善情况......）

DECLARE @artist_filter TABLE (
  tag_id INT
)

DECLARE @venue_filter TABLE (
  tag_id INT
)

INSERT INTO @artist_filter
SELECT id FROM tag
WHERE name IN ('techno','trombone')

INSERT INTO @venue_filter
SELECT id FROM tag
WHERE name IN ('cheap-beer','great-most-pits')


SELECT
  concert.id AS concert_id,
  concert.date AS concert_date,
  artist.id AS artist_id,
  venue.id AS venue_id
FROM
  concert
INNER JOIN
  artist_tag
    ON artist_tag.artist_id = concert.artist_id
INNER JOIN
  @artist_filter AS [artist_filter]
    ON [artist_filter].tag_id = artist_tag.id
INNER JOIN
  venue_tag
    ON venue_tag.venue_id = concert.venue_id
INNER JOIN
  @venue_filter AS [venue_filter]
    ON [venue_filter].tag_id = venue_tag.id
WHERE
  concert.date BETWEEN NOW() AND (NOW() + INTERVAL 1 MONTH)
GROUP BY
  concert.id,
  concert.date,
  artist_tag.artist_id,
  venue_tag.id
HAVING
  COUNT(DISTINCT [artist_filter].id) = (SELECT COUNT(*) FROM @artist_filter)
  AND
  COUNT(DISTINCT [venue_filter].id)  = (SELECT COUNT(*) FROM @venue_filter)

（我在上网本上并为此受苦，所以我'将省略从艺术家和场地表中获取艺术家和场地名称的外部查询grin）

编辑
注意：

另一种选择是过滤子查询/派生表中的artist_tag 和venue_tag 表。这是否值得取决于 Concert 表上的连接有多大影响。我的假设是，有很多艺术家和场馆，但一旦在音乐会表上进行过滤（本身按日期过滤），艺术家/场馆的数量就会急剧减少。

此外，经常需要/渴望处理没有指定艺术家标签和/或场地标签的情况。根据经验，最好以编程方式处理这个问题。也就是说，使用特别适合这些情况的 IF 语句和查询。可以编写单个 SQL 查询来处理它，但比编程替代方案慢得多。同样，多次编写类似的查询可能看起来很混乱并降低可维护性，但是将其变成单个查询所需的复杂性增加通常更难维护。

编辑

另一个类似的布局可能是...
- 按艺术家过滤音乐会作为子查询/派生表
- 按地点过滤结果作为子查询/派生表
- 连接维度表上的结果以获取名称等

（级联过滤）

SELECT
   <blah>
FROM
  (
    SELECT
      <blah>
    FROM
      (
        SELECT
          <blah>
        FROM
          concert
        INNER JOIN
          artist_tag
        INNER JOIN
          artist_filter
        WHERE
        GROUP BY
        HAVING
      )
    INNER JOIN
      venue_tag
    INNER JOIN
      venue_filter
    GROUP BY
    HAVING
  )
INNER JOIN
  artist
INNER JOIN
  venue

通过级联过滤，每个后续过滤都有一个必须处理的归约集。这可能会减少查询的 GROUP BY - HAVING 部分完成的工作。对于两个级别的过滤，我猜这不太可能是戏剧性的。

原始版本可能仍然具有更高的性能，因为它有利于以不同方式进行附加过滤。在您的示例中：
- 您的约会范围内可能有很多艺术家，但满足至少一个条件的艺术家很少
- 您的日期范围内可能有很多场馆，但满足至少一个条件的场馆很少
- 然而，在 GROUP BY 之前，所有音乐会都被淘汰，其中...
---> 艺术家不符合任何标准
---> 和/或场地不符合任何条件

当您按多个条件搜索时，此过滤效果会降低。此外，在场地和/或艺术家共享大量标签的情况下，过滤也会降低。

那么什么时候我会使用原始版本，或者什么时候会使用级联版本？
- 原创：很少有搜索条件和场地/艺术家彼此不相似
- 级联：许多搜索条件或场地/艺术家往往相似

My approach is a bit more generic, putting the filter parameters in tables and then using GROUP BY, HAVING and COUNT to filter the results. I've used this basic approach several times for some very sophisticated 'searching' and it works very well (for me grin).

I also don't join on the Artist and Venue dimension tables initially. I'd get the results as id's (just needing artist_tag and venue_tag) then join the results on the artist and venue tables to get those dimension values. (Basically, search for the entity id's in a sub query, then in an outer query get the dimension values you need. Keeping them separate should improve things...)

DECLARE @artist_filter TABLE (
  tag_id INT
)

DECLARE @venue_filter TABLE (
  tag_id INT
)

INSERT INTO @artist_filter
SELECT id FROM tag
WHERE name IN ('techno','trombone')

INSERT INTO @venue_filter
SELECT id FROM tag
WHERE name IN ('cheap-beer','great-most-pits')


SELECT
  concert.id AS concert_id,
  concert.date AS concert_date,
  artist.id AS artist_id,
  venue.id AS venue_id
FROM
  concert
INNER JOIN
  artist_tag
    ON artist_tag.artist_id = concert.artist_id
INNER JOIN
  @artist_filter AS [artist_filter]
    ON [artist_filter].tag_id = artist_tag.id
INNER JOIN
  venue_tag
    ON venue_tag.venue_id = concert.venue_id
INNER JOIN
  @venue_filter AS [venue_filter]
    ON [venue_filter].tag_id = venue_tag.id
WHERE
  concert.date BETWEEN NOW() AND (NOW() + INTERVAL 1 MONTH)
GROUP BY
  concert.id,
  concert.date,
  artist_tag.artist_id,
  venue_tag.id
HAVING
  COUNT(DISTINCT [artist_filter].id) = (SELECT COUNT(*) FROM @artist_filter)
  AND
  COUNT(DISTINCT [venue_filter].id)  = (SELECT COUNT(*) FROM @venue_filter)

(I'm on a netbook and suffering for it, so I'll leave out the outer query getting the artist and venue names from the artist and venue tables grin)

EDIT
Note:

Another option would be to filter the artist_tag and venue_tag tables in sub-queries/derived-tables. Whether this is worth it depends on how influential the join on the Concert table is. My assumption here is that there are MANY artist and venues, but once filtered on the concert table (itself filtered by the dates) the number of artists/venues decreases dramatically.

Also, there is often a need/desire to deal with the case where NO artist_tags and/or venue_tags are specified. From experience it is better to deal with this programatically. That is, use IF statements and queries specially suited to those cases. A single SQL query CAN be written to handle it, but is much slower than the programatic alternative. Equally, writing similar queries several times may look messy and degrade maintainability, but the increase in complexity need to get this to be a single query is often harder to maintain.

EDIT

Another similar layout could be...
- Filter concert by artist as sub_query/derived_table
- Filter results by venue as sub_query/derived_table
- Join results on dimension tables to get names, etc

(Cascaded filtering)

SELECT
   <blah>
FROM
  (
    SELECT
      <blah>
    FROM
      (
        SELECT
          <blah>
        FROM
          concert
        INNER JOIN
          artist_tag
        INNER JOIN
          artist_filter
        WHERE
        GROUP BY
        HAVING
      )
    INNER JOIN
      venue_tag
    INNER JOIN
      venue_filter
    GROUP BY
    HAVING
  )
INNER JOIN
  artist
INNER JOIN
  venue

By cascading the filtering, each subsequent filtering has a reduce set it has to work on. This MAY reduce the work done by the GROUP BY - HAVING section of the query. For two levels of filtering I would guess this to be unlikely to be dramatic.

The original may still be more performant as it benefits additional filtering in a different manner. In your example:
- There may be many artist in your date range, but few which meet at least one criteria
- There may be many venues in your date range, but few which meet at least one criteria
- Before the GROUP BY, however, all concerts are eliminated where...
---> the artist(s) meets NONE of the criteria
---> AND/OR the venue meets NONE of the criteria

Where you are searching by many criteria this filtering degrades. Also where venues and/or artists share a lot of tags, the filtering also degrades.

So when would I use the original, or when would I use the Cascaded version?
- Original : Few search criteria and venues/artists are dis-similar from each other
- Cascaded : Lots of search criteria or venues/artists tend to be similar

回复收藏 0 原文

茶底世界 2024-07-24 11:55:52

对模型进行非规范化。在场地和艺术家表中包含标签名称。这样，您就可以避免多对多关系，并且拥有一个简单的星型模式。

通过应用此非规范化，where 子句只能检查两个表（艺术家和场地）中的此附加 tag_name 字段。

回复收藏 0 原文

鲜血染红嫁衣 2024-07-24 11:55:52

这种情况从技术上讲并不是多个事实表。场地和场地之间存在多对多的关系。标签以及艺术家和标签。

我认为 MatBailie 上面提供了一些有趣的示例，但我觉得如果您以有用的方式处理应用程序中的参数，这会简单得多。

除了用户在事实表上生成的查询之外，您还需要两个静态查询来首先向用户提供参数选项。其中一个是适合场地的标签列表，另一个是适合艺术家的标签。

适合场地的标签：

SELECT DISTINCT tag_id, tag.name as VenueTagName
FROM venue_tag 
INNER JOIN tag 
ON venue_tag.tag_id = tag.id

适合艺术家的标签：

SELECT DISTINCT tag_id, tag.name as ArtistTagName
FROM artist_tag 
INNER JOIN tag 
ON artist_tag.tag_id = tag.id

这两个查询驱动一些下拉或其他参数选择控件。在报告系统中，您应该尝试避免传递字符串变量。在您的应用程序中，您向用户提供变量的字符串名称，但将整数 ID 传递回数据库。

例如，当用户选择标签时，您将获取 tag.id 值并将其提供给您的查询（其中我有 (1,2) 和 (100,200)下面一点）：

 SELECT
  concert.id AS concert_id,
  concert.date AS concert_date,
  artist.id AS artist_id,
  artist.name AS artist_name,
  venue.id AS venue_id,
  venue.name AS venue_name,
FROM 
concert
INNER JOIN artist 
    ON artist.id = concert.artist_id
INNER JOIN artist_tag
    ON artist.id = artist_tag.artist_id
INNER JOIN venue 
    ON venue.id = concert.venue_id
INNER JOIN venue_tag
    ON venue.id = venue_tag.venue_id
WHERE venue_tag.tag_id in ( 1,2 ) -- Assumes that the IDs 1 and 2 map to "cheap-beer" and "great-mosh-pits)
AND   artist_tag.tag_id in (100,200) -- Assumes that the IDs 100 and 200 map to "techno" and "trombone") Sounds like a wild night of drunken moshing to brass band techno!
AND concert.date BETWEEN NOW() AND (NOW() + INTERVAL 1 MONTH)

This situation is not technically multiple fact tables. You have a many to many relationship between venues & tags as well as artists & tags.

I think MatBailie provides some interesting examples above, but I feel this can be much simpler if you handle the parameters in your application in a helpful way.

Apart from the user generated query on the fact table, you need two static queries to provide the parameter options to the user in the first place. One of them is a list of Venue-appropriate tags, the other is for Artist-appropriate tags.

Venue appropriate tags:

SELECT DISTINCT tag_id, tag.name as VenueTagName
FROM venue_tag 
INNER JOIN tag 
ON venue_tag.tag_id = tag.id

Artist appropriate tags:

SELECT DISTINCT tag_id, tag.name as ArtistTagName
FROM artist_tag 
INNER JOIN tag 
ON artist_tag.tag_id = tag.id

Those two queries drive some drop-down or other parameter-selection controls. In a reporting system, you should try and avoid passing string variables around. In your application, you present the string name of the variable to the user, but pass the integer ID back to the database.

e.g. When the user chooses the tags, you take the tag.id values and provide them to your query (where I have the (1,2) and the (100,200) bit below) :

 SELECT
  concert.id AS concert_id,
  concert.date AS concert_date,
  artist.id AS artist_id,
  artist.name AS artist_name,
  venue.id AS venue_id,
  venue.name AS venue_name,
FROM 
concert
INNER JOIN artist 
    ON artist.id = concert.artist_id
INNER JOIN artist_tag
    ON artist.id = artist_tag.artist_id
INNER JOIN venue 
    ON venue.id = concert.venue_id
INNER JOIN venue_tag
    ON venue.id = venue_tag.venue_id
WHERE venue_tag.tag_id in ( 1,2 ) -- Assumes that the IDs 1 and 2 map to "cheap-beer" and "great-mosh-pits)
AND   artist_tag.tag_id in (100,200) -- Assumes that the IDs 100 and 200 map to "techno" and "trombone") Sounds like a wild night of drunken moshing to brass band techno!
AND concert.date BETWEEN NOW() AND (NOW() + INTERVAL 1 MONTH)

回复收藏 0 原文

~没有更多了~