大型mysql表的规范化/优化结构

发布于 2024-12-14 11:26:14 字数 3426 浏览 1 评论 0原文

我有一个拥有一群用户和一堆“节点”(内容)的网站。每个节点都可以被下载,除了所讨论的特定节点 ID 之外,每个下载都有一个与之关联的“许可证”(因此用户可以下载节点 5 用于“商业用途”或“个人用途”等),如以及每个许可证的价格。

我的目标是以这样的方式跟踪下载,使我能够:

  • 获取给定时间段内给定节点 id 和许可证 id 的下载次数(上个月节点 5 在 ' 的时间内被下载了多少次)商业用途'?)。
  • 获取给定节点 ID 和许可证 ID 的下载总数。
  • 获取给定 node_id 的下载次数,无论许可证如何(“商业用途”和“个人用途”的所有下载组合)。
  • 获取满足给定价格标准(即价格= 0,或价格> 0)的给定用户已下载的节点ID(以及相应的许可证ID)。

如果优化无关紧要,则存储的数据很简单,但我的问题是可能轻松增长到数百万行的表的规范化/优化之一。具体假设:

  • 下载量为千万级。
  • 节点数量达数十万。
  • 用户数量达数万。

我对任何“真正的”mysql 工作都很陌生,所以我感谢你的帮助,并指出我愚蠢的地方。这是我到目前为止所得到的:

all_downloads 表

   +-------------+---------+------------+---------+-----------+-------+
   | download_id | node_id | license_id | user_id | timestamp | price |
   +-------------+---------+------------+---------+-----------+-------+

download_id 是该表的唯一键。该表是一个问题,因为它可能有数千万行。

downloads_counted 表

不是通过查询 all_downloads 表来累加给定节点和许可证的下载总数,而是在 cron 运行期间对下载进行计数,并且这些数字单独存储在 downloads_counted 表中

   +---------------------------------------------------------------------------+
   | node_id | license_id | downloads_total | downloads_month | downloads_week |  
   +---------------------------------------------------------------------------+

:许可证 ID 情况是新的(以前只有一个许可证,因此数据库中没有跟踪许可证),所以这就是我现在只是想弄清楚如何使用的情况。过去,node_id 是该表的唯一键。我假设我现在应该做的是将node_id 和license_id 组合成一个唯一的主键。或者是否可以将node_id 保留为该表的唯一键,并获取给定node_id 的所有行,然后在php 中解析结果(分离或合并每个特定许可证的下载)?拥有一个没有唯一键的表是否符合最佳实践?

无论如何,我认为这个表基本上没问题,因为它不应该增长到超过 1 或 200 万行。

返回给定用户的下载的问题

这是我需要帮助的主要领域。我考虑过将 user_id 作为 all_downloads 表中的键,然后简单地查询包含给定 user_id 的所有行。但我担心从长远来看查询这个表,因为它从一开始就会非常大,并且很容易增长到数千万行。

我考虑过创建一个看起来像这样的 user_downloads 表:

   +---------------------+
   | user_id | downloads | 
   +---------------------+

其中 downloads 将是一个由 node_ids 和关联的许可证 id 和价格组成的序列化数组,如下所示(5 是 node_id,并且是 node_ids 顶级数组中的索引) :

downloads = array('5' = array(license = array('personal', 'commercial'), price = 25))

我意识到将数据数组存储在单个单元格中被认为是不好的做法,并且我不确定这是否会提高性能,因为对于给定用户来说,下载数组很容易增长到数千个。但是,我不确定如何创建另一个表结构,该结构在获取给定用户的下载方面比我的 all_downloads 表更有效。

非常感谢任何和所有的帮助!

======================================

比尔·卡尔文回答的后续问题:

  • 不幸的是,时间戳将是存储在 int(11),而不是日期时间(以符合 Drupal 标准)。我 假设优化并没有真正改变任何东西 不幸的是,

  • node_id/license_id/user_id(您对集群主键的想法)是 保证是唯一的,因为用户可以根据需要多次下载同一许可证下的同一节点。这 是我为每一行拥有唯一的 download_id 的主要原因...... 是否有特殊原因导致 download_id 会损害性能?或者将主键设为 download_id/node_id/license_id/user_id 的集群是否可以接受?或者将 download_id 作为复合键的第一部分会失去其用处吗?

  • 您认为拥有 downloads_counted 表仍然有意义,还是会被认为是多余的?我的想法是,它仍然有助于提高性能,因为下载计数(下载总数、本周、本月等)将非常频繁地显示在网站上,并且 downloads_counted 表将行数比 all_downloads 表少一或两个数量级。

我对 downloads_counted 表的想法:

CREATE TABLE downloads_counted (   
 node_id          INT UNSIGNED NOT NULL,   
 license_id       INT UNSIGNED NOT NULL, 
 downloads_total  INT UNSIGNED NOT NULL,  
 downloads_month  INT UNSIGNED NOT NULL,   
 downloads_week   INT UNSIGNED NOT NULL,     
 downloads_day    INT UNSIGNED NOT NULL,  
 PRIMARY KEY (node_id, license_id), 
 KEY (node_id)
) ENGINE=InnoDB;

node_id 上的辅助键用于获取给定 node_id 的所有许可证的所有下载...但是,如果 node_id 已经是复合主键的第一部分,那么这个键是多余的吗?

I have a site with a bunch of users, and a bunch of "nodes" (content). Each node can be downloaded, and besides the particular node id in question, each download has a "license" associated with it (so a user can download node 5 for 'commercial use' or for 'personal use', etc.), as well as a price for each license.

My goal is to keep track of downloads in such a way that allows me to:

  • Get the number of downloads for a given node id and license id over a given time period (how many times has node 5 been downloaded in the last month for 'commercial use'?).
  • Get the total number of downloads for a given node id and license id.
  • Get the number of downloads for a given node_id regardless of license (all downloads for 'commercial use' and 'personal use' combined).
  • Get the node ids (and corresponding license ids) that have been downloaded by a given user that meet a given price criteria (i.e. price = 0, or price > 0).

Trivial data to store if optimization doesn't matter, but my issue is one of normalization/optimization for tables that may easily grow to millions of rows. Specifically, assume that:

  • Number of downloads is in the tens of millions.
  • Number of nodes is in the hundreds of thousands.
  • Number of users is in the tens of thousands.

I'm fairly new to any "real" mysql work, so I appreciate your help, and pointing out where I'm being stupid. Here's what I've got so far:

all_downloads table

   +-------------+---------+------------+---------+-----------+-------+
   | download_id | node_id | license_id | user_id | timestamp | price |
   +-------------+---------+------------+---------+-----------+-------+

download_id is a a unique key for this table. This table is a problem, because it could potentially have tens of millions of rows.

downloads_counted table

Instead of adding up the total number of downloads for a given node and license by querying the all_downloads table, the downloads are counted during cron run, and those numbers are stored separately in a downloads_counted table:

   +---------------------------------------------------------------------------+
   | node_id | license_id | downloads_total | downloads_month | downloads_week |  
   +---------------------------------------------------------------------------+

The license id situation is new (formerly there was only one license, so licenses were not tracked in the database), so that's something I'm just trying to figure out how to work with now. In the past, node_id was a unique key for this table. I'm assuming that what I should do now is make the combination of node_id and license_id into a unique primary key. Or is it just as well to leave node_id as the only key for this table, and grab all rows for a given node_id, then parse the results in php (separating or combining downloads for each particular license)? Is it within best practice to have a table with no unique key?

In any case, I think this table is mostly okay, as it shouldn't grow to more than 1 or 2 million rows.

The question of returning downloads for a given user

This is the main area where I need help. I have considered just making the user_id a key in the all_downloads table, and simply querying for all rows that contain a given user_id. But I am concerned about querying this table in the long run, as it will be very large from the start, and could easily grow to tens of millions of rows.

I have considered creating a user_downloads table that would look something like this:

   +---------------------+
   | user_id | downloads | 
   +---------------------+

Where downloads would be a serialized array of node_ids and associated license ids and prices like so (5 is the node_id and would be the index within the top-level array of node_ids):

downloads = array('5' = array(license = array('personal', 'commercial'), price = 25))

I realize storing arrays of data in a single cell is considered bad practice, and I'm not sure that it would improve performance, since the array of downloads could easily grow into the thousands for a given user. However, I'm not sure how to create another table structure that would be more efficient than my all_downloads table at getting the downloads for a given user.

Any and all help is much appreciated!

====================================

Followup questions to Bill Karwin's answer:

  • timestamp is unfortunately going to be a unix timestamp stored in an
    int(11), rather than a datetime (to conform to Drupal standards). I
    assume that doesn't really change anything from an optimization
    standpoint?

  • node_id/license_id/user_id (your idea for a clustered primary key) is
    not guaranteed to be unique, because users are allowed to download the same node under the same license as many times as they want. This
    was my primary reason for having a unique download_id for each row...
    is there a special reason that having a download_id would hurt performance? Or would it be acceptable to make the primary key a cluster of download_id/node_id/license_id/user_id? Or will having the download_id as the first part of the compound key throw off its usefulness?

  • Do you think it still makes sense to have a downloads_counted table, or would that be considered redundant? My thinking is that it would still help performance, since download counts (downloads total, this week, this month, etc.) are going to be showing up very frequently on the site, and the downloads_counted table would have one or two orders of magnitude fewer rows than the all_downloads table.

My idea for the downloads_counted table:

CREATE TABLE downloads_counted (   
 node_id          INT UNSIGNED NOT NULL,   
 license_id       INT UNSIGNED NOT NULL, 
 downloads_total  INT UNSIGNED NOT NULL,  
 downloads_month  INT UNSIGNED NOT NULL,   
 downloads_week   INT UNSIGNED NOT NULL,     
 downloads_day    INT UNSIGNED NOT NULL,  
 PRIMARY KEY (node_id, license_id), 
 KEY (node_id)
) ENGINE=InnoDB;

The secondary key on node_id is for getting all downloads for all licenses for a given node_id... is this key redundant, though, if node_id is already the first part of the compound primary key?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

万劫不复 2024-12-21 11:26:14

以下是我设计该表的方式:

CREATE TABLE all_downloads (
  node_id    INT UNSIGNED NOT NULL,
  license_id INT UNSIGNED NOT NULL,
  user_id    INT UNSIGNED NOT NULL,
  timestamp  DATETIME NOT NULL,
  price      NUMERIC (9,2),
  PRIMARY KEY (node_id,license_id,user_id),
  KEY (price)
) ENGINE=InnoDB;

注意我省略了 download_id。

现在您可以运行所需的查询:

  • 获取给定节点 ID 和许可证 ID 在给定时间内的下载次数时间段(上个月节点 5 被下载了多少次用于“商业用途”?)。

    从 all_downloads 中选择 COUNT(*) 个,其中 (node_id,license_id) = (123,456) 
    AND 时间戳 > NOW() - 间隔 30 天
    

    这应该充分利用聚集主索引,减少检查的行集,直到时间戳比较仅适用于一小部分。

  • 获取给定节点 ID 和许可证 ID 的下载总数。

    SELECT COUNT(*) FROM all_downloads WHERE (node_id,license_id) = (123,456);
    

    与上面一样,这利用了聚集主索引。计数是通过索引扫描完成的。

  • 获取给定node_id的下载次数,无论许可证如何(“商业用途”和“个人用途”的所有下载组合)。

    从 all_downloads 中选择 COUNT(*) 个,其中 (node_id) = (123);
    

    同上。

  • 获取满足给定价格标准(即价格= 0,或价格> 0)的给定用户已下载的节点ID(以及相应的许可证ID)。

    从 all_downloads 中选择 node_id、license_id,其中价格 = 0 并且 user_id = 789;
    

    这减少了使用 price 上的二级索引检查的行数。然后,您可以利用 InnoDB 中的二级索引隐式包含主键列的事实,因此您甚至不需要读取基础数据。这称为覆盖索引或仅索引查询。

至于您的其他问题:


时间戳...从优化的角度来看并没有真正改变任何东西?

我更喜欢日期时间而不是时间戳,只是因为日期时间包含时区信息,而时间戳不包含。您始终可以使用 UNIX_TIMESTAMP() 函数。

将主键设为 download_id/node_id/license_id/user_id 的簇可以接受吗?或者将 download_id 作为复合键的第一部分会失去其用处吗?

聚集键的好处是行按照索引的顺序存储。因此,如果您经常根据node_id进行查询,那么将其放在复合聚集索引中会具有性能优势。即,如果您对给定 node_id 的行集感兴趣,那么将它们存储在一起是一个好处,因为您以这种方式定义了聚集索引。

您认为拥有 downloads_counted 表仍然有意义,还是会被认为是多余的?

当然,将汇总结果存储在表中是减少频繁计算经常需要的总数的工作的常见方法。但要谨慎行事,因为需要做一些工作才能使这些总数与真实数据保持同步。如果您需要经常读取预先计算的总计,并且每次更新时多次读取,那么好处会更大。确保将汇总总数视为不如真实下载数据那么权威,并制定计划在总数不同步时重新生成总数。

有些人还将这些聚合放入 memcached 键而不是表中,以便更快地查找。如果memcached中的易失性数据由于某种原因丢失,您可以从下载数据中重新填充它。

 PRIMARY KEY (node_id, license_id), 
 KEY (node_id)
) ENGINE=InnoDB;

如果node_id已经是复合主键的第一部分,那么这个键是多余的吗?

是的。 MySQL 允许您创建冗余索引,这是冗余索引的一个示例。任何可以使用 node_id 上的辅助键的查询都可以轻松地使用主键。事实上,在这种情况下,优化器将永远使用辅助键,因为它更喜欢主键的聚集索引。

您可以使用 pt-duplicate-key-checker 分析数据库中的冗余索引。

Here's how I would design the table:

CREATE TABLE all_downloads (
  node_id    INT UNSIGNED NOT NULL,
  license_id INT UNSIGNED NOT NULL,
  user_id    INT UNSIGNED NOT NULL,
  timestamp  DATETIME NOT NULL,
  price      NUMERIC (9,2),
  PRIMARY KEY (node_id,license_id,user_id),
  KEY (price)
) ENGINE=InnoDB;

Notice I omitted the download_id.

Now you can run the queries you need to:

  • Get the number of downloads for a given node id and license id over a given time period (how many times has node 5 been downloaded in the last month for 'commercial use'?).

    SELECT COUNT(*) FROM all_downloads WHERE (node_id,license_id) = (123,456) 
    AND timestamp > NOW() - INTERVAL 30 DAY
    

    This should make good use of the clustered primary index, reducing the set of rows examined until the timestamp comparison only applies to a small subset.

  • Get the total number of downloads for a given node id and license id.

    SELECT COUNT(*) FROM all_downloads WHERE (node_id,license_id) = (123,456);
    

    Like the above, this makes use of the clustered primary index. Counting is accomplished by an index scan.

  • Get the number of downloads for a given node_id regardless of license (all downloads for 'commercial use' and 'personal use' combined).

    SELECT COUNT(*) FROM all_downloads WHERE (node_id) = (123);
    

    Ditto.

  • Get the node ids (and corresponding license ids) that have been downloaded by a given user that meet a given price criteria (i.e. price = 0, or price > 0).

    SELECT node_id, license_id FROM all_downloads WHERE price = 0 AND user_id = 789;
    

    This reduces the rows examined by using the secondary index on price. Then you take advantage of the fact that secondary indexes in InnoDB implicitly contain the columns of the primary key, so you don't even need to read the base data. This is called a covering index or an index-only query.

As for your other questions:


timestamp ... doesn't really change anything from an optimization standpoint?

I prefer datetime over timestamp only because datetime includes timezone information, and timestamp does not. You can always convert a datetime to a UNIX timestamp integer in a query result, using the UNIX_TIMESTAMP() function.

would it be acceptable to make the primary key a cluster of download_id/node_id/license_id/user_id? Or will having the download_id as the first part of the compound key throw off its usefulness?

The benefit of a clustered key is that the rows are stored in order of the index. So if you query based on node_id frequently, there's a performance advantage to putting that first in the compound clustered index. I.e. if you are interested in the set of rows for a given node_id, it's a benefit that they're stored together because you defined the clustered index that way.

Do you think it still makes sense to have a downloads_counted table, or would that be considered redundant?

Sure, storing aggregate results in a table is a common way to reduce the work of counting up frequently-needed totals so often. But do so judiciously, because it takes some work to keep these totals in sync with the real data. The benefit is greater if you need to read the pre-calculated totals frequently, and multiple times for each time they are updated. Make sure you treat the aggregated totals as less authoritative than the real download data, and have a plan for re-generating the totals when they get out of sync.

Some people also put these aggregates into memcached keys instead of in a table, for even faster lookups. If the volatile data in memcached is lost for some reason, you can re-populate it from the download data.

 PRIMARY KEY (node_id, license_id), 
 KEY (node_id)
) ENGINE=InnoDB;

is this key redundant, though, if node_id is already the first part of the compound primary key?

Yes. MySQL allows you to create redundant indexes, and this is an example of a redundant index. Any query that could use the secondary key on node_id could just as easily use the primary key. In fact, in this case the optimizer will never use the secondary key, because it will prefer the clustered index of the primary key.

You can use pt-duplicate-key-checker to analyze a database for redundant indexes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文