MySQL ON DUPLICATE KEY UPDATE,唯一键中的列可为空

发布于 2024-08-02 10:31:55 字数 2843 浏览 11 评论 0原文

我们的 MySQL Web 分析数据库包含一个汇总表,该表会随着新活动的导入而全天更新。我们使用 ON DUPLICATE KEY UPDATE 以便汇总覆盖之前的计算,但遇到困难,因为汇总表的 UNIQUE KEY 中的列之一是可选的 FK,并且包含 NULL 值。

这些 NULL 旨在表示“不存在,并且所有此类情况都是等效的”。当然,MySQL 通常将 NULL 视为“未知,并且所有此类情况都不等效”。

基本结构如下:

“活动”表,其中包含每个会话的一个条目,每个会话都属于一个活动,并且某些条目具有可选的过滤器和事务 ID。

CREATE TABLE `Activity` (
    `session_id` INTEGER AUTO_INCREMENT
    , `campaign_id` INTEGER NOT NULL
    , `filter_id` INTEGER DEFAULT NULL
    , `transaction_id` INTEGER DEFAULT NULL
    , PRIMARY KEY (`session_id`)
);

“摘要”表,其中包含活动表中会话总数的每日汇总以及包含事务 ID 的会话总数。这些摘要是分开的,每一个活动和(可选)过滤器的组合都有一个摘要。这是一个使用 MyISAM 的非事务表。

CREATE TABLE `Summary` (
    `day` DATE NOT NULL
    , `campaign_id` INTEGER NOT NULL
    , `filter_id` INTEGER DEFAULT NULL
    , `sessions` INTEGER UNSIGNED DEFAULT NULL
    , `transactions` INTEGER UNSIGNED DEFAULT NULL
    , UNIQUE KEY (`day`, `campaign_id`, `filter_id`)
) ENGINE=MyISAM;

实际的汇总查询如下所示,计算会话和交易的数量,然后按活动和(可选)过滤器进行分组。

INSERT INTO `Summary` 
    (`day`, `campaign_id`, `filter_id`, `sessions`, `transactions`)
    SELECT `day`, `campaign_id`, `filter_id
        , COUNT(`session_id`) AS `sessions`
        , COUNT(`transaction_id` IS NOT NULL) AS `transactions`
    FROM Activity
    GROUP BY `day`, `campaign_id`, `filter_id`
ON DUPLICATE KEY UPDATE
    `sessions` = VALUES(`sessions`)
    , `transactions` = VALUES(`transactions`)
;

除了 filter_id 为 NULL 的情况摘要之外,一切都运行良好。在这些情况下,ON DUPLICATE KEY UPDATE 子句与现有行不匹配,并且每次都会写入新行。这是因为“NULL != NULL”。然而,在比较唯一键时,我们需要的是“NULL = NULL”。

我正在寻找解决方法的想法或对我们迄今为止提出的想法的反馈。到目前为止我们想到的解决方法如下。

  1. 在运行摘要之前删除所有包含 NULL 键值的摘要条目。 (这就是我们现在正在做的事情) 如果在汇总过程中执行查询,这会产生负面影响,即返回缺少数据的结果。

  2. 将 DEFAULT NULL 列更改为 DEFAULT 0,这样可以使 UNIQUE KEY 一致匹配。 这具有负面影响,即使针对汇总表的查询的开发过于复杂。它迫使我们使用大量“CASE filter_id = 0 THEN NULL ELSE filter_id END”,并且由于所有其他表的 filter_id 都有实际的 NULL,因此会导致连接不便。

  3. 创建一个返回“CASE filter_id = 0 THEN NULL ELSE filter_id END”的视图,并使用该视图而不是直接使用表。 摘要表包含几十万行,我被告知视图性能非常差。

  4. 允许创建重复条目,并在汇总完成后删除旧条目。 与提前删除它们有类似的问题。

  5. 添加一个包含 0 代表 NULL 的代理列,并在 UNIQUE KEY 中使用该代理(实际上,如果所有列都不为 NULL,我们可以使用 PRIMARY KEY)。
    这个解决方案看起来很合理,只不过上面的例子只是一个例子;实际数据库包含六个汇总表,其中一个在 UNIQUE KEY 中包含四个可为空的列。有些人担心开销太大。

您是否有更好的解决方法、表结构、更新过程或 MySQL 最佳实践可以提供帮助?

编辑:为了澄清“null 的含义”,

包含 NULL 列的汇总行中的数据仅在汇总报告中作为单个“包罗万象”行的意义上才被视为属于一起,汇总了该数据的那些项目点不存在或未知。因此,在汇总表本身的上下文中,含义是“未知值的那些条目的总和”。另一方面,在关系表中,这些确实是 NULL 结果。

将它们放入汇总表上的唯一键中的唯一原因是允许在重新计算汇总报告时自动更新(通过重复键更新)。

也许更好的描述方法是通过具体示例,其中一个汇总表按受访者给出的公司地址的邮政编码前缀按地理位置对结果进行分组。并非所有受访者都提供营业地址,因此交易和地址表之间的关系完全正确为 NULL。在此数据的汇总表中,为每个邮政编码前缀生成一行,包含该区域内数据的汇总。生成附加行以显示没有已知邮政编码前缀的数据摘要。

将其余数据表更改为具有显式“THERE_IS_NO_ZIP_CODE”0 值,并在 ZipCodePrefix 表中放置表示该值的特殊记录是不正确的 - 该关系确实为 NULL。

Our MySQL web analytics database contains a summary table which is updated throughout the day as new activity is imported. We use ON DUPLICATE KEY UPDATE in order that the summarization overwrites earlier calculations, but are having difficulty because one of the columns in the summary table's UNIQUE KEY is an optional FK, and contains NULL values.

These NULLs are intended to mean "not present, and all such cases are equivalent". Of course, MySQL usually treats NULLs as meaning "unknown, and all such cases are not equivalent".

Basic structure is as follows:

An "Activity" table containing an entry for each session, each belonging to a campaign, with optional filter and transaction IDs for some entries.

CREATE TABLE `Activity` (
    `session_id` INTEGER AUTO_INCREMENT
    , `campaign_id` INTEGER NOT NULL
    , `filter_id` INTEGER DEFAULT NULL
    , `transaction_id` INTEGER DEFAULT NULL
    , PRIMARY KEY (`session_id`)
);

A "Summary" table containing daily rollups of total number of sessions in activity table, an d the total number of those sessions which contain a transaction ID. These summaries are split up, with one for every combination of campaign and (optional) filter. This is a non-transactional table using MyISAM.

CREATE TABLE `Summary` (
    `day` DATE NOT NULL
    , `campaign_id` INTEGER NOT NULL
    , `filter_id` INTEGER DEFAULT NULL
    , `sessions` INTEGER UNSIGNED DEFAULT NULL
    , `transactions` INTEGER UNSIGNED DEFAULT NULL
    , UNIQUE KEY (`day`, `campaign_id`, `filter_id`)
) ENGINE=MyISAM;

The actual summarization query is something like the following, counting up the number of sessions and transactions, then grouping by campaign and (optional) filter.

INSERT INTO `Summary` 
    (`day`, `campaign_id`, `filter_id`, `sessions`, `transactions`)
    SELECT `day`, `campaign_id`, `filter_id
        , COUNT(`session_id`) AS `sessions`
        , COUNT(`transaction_id` IS NOT NULL) AS `transactions`
    FROM Activity
    GROUP BY `day`, `campaign_id`, `filter_id`
ON DUPLICATE KEY UPDATE
    `sessions` = VALUES(`sessions`)
    , `transactions` = VALUES(`transactions`)
;

Everything works great, except for the summary of cases where the filter_id is NULL. In these cases, the ON DUPLICATE KEY UPDATE clause does not match the existing row, and a new row is written every time. This is due to the fact that "NULL != NULL". What we need, however, is "NULL = NULL" when comparing the unique keys.

I am looking for ideas for workarounds or feedback on those we have come up with so far. Workarounds we have thought of so far follow.

  1. Delete all summary entries containing a NULL key value prior to running the summarization. (This is what we are doing now)
    This has the negative side effect of returning results with missing data if a query is executed during the summarization process.

  2. Change the DEFAULT NULL column to DEFAULT 0, which allows the UNIQUE KEY to be matched consistently.
    This has the negative side effect of overly complicating the development of queries against the summary table. It forces us to use a lot of "CASE filter_id = 0 THEN NULL ELSE filter_id END", and makes for awkward joining since all of the other tables have actual NULLs for the filter_id.

  3. Create a view which returns "CASE filter_id = 0 THEN NULL ELSE filter_id END", and using this view instead of the table directly.
    The summary table contains a few hundred thousand rows, and I've been told view performance is quite poor.

  4. Allow the duplicate entries to be created, and delete the old entries after summarization completes.
    Has similar problems to deleting them ahead of time.

  5. Add a surrogate column which contains 0 for NULL, and use that surrogate in the UNIQUE KEY (actually we could use PRIMARY KEY if all columns are NOT NULL).
    This solution seems reasonable, except that the example above is only an example; the actual database contains half a dozen summary tables, one of which contains four nullable columns in the UNIQUE KEY. There is concern by some that the overhead is too much.

Do you have a better workaround, table structure, update process or MySQL best practice which can help?

EDIT: To clarify the "meaning of null"

The data in the summary rows containing NULL columns are considered to belong together only in the sense that of being a single "catch-all" row in summary reports, summarizing those items for which that data point does not exist or is unknown. So within the context of the summary table itself, the meaning is "the sum of those entries for which no value is known". Within the relational tables, on the other hand, these truly are NULL results.

The only reason for putting them into a unique key on the summary table is to allow for automatic update (by ON DUPLICATE KEY UPDATE) when re-calculating the summary reports.

Maybe a better way to describe it is by the specific example that one of the summary tables groups results geographically by the zip code prefix of the business address given by the respondent. Not all respondents provide a business address, so the relationship between the transaction and addresses table is quite correctly NULL. In the summary table for this data, a row is generated for each zip code prefix, containing the summary of data within that area. An additional row is generated to show the summary of data for which no zip code prefix is known.

Altering the rest of the data tables to have an explicit "THERE_IS_NO_ZIP_CODE" 0-value, and placing a special record in the ZipCodePrefix table representing this value, is improper--that relationship truly is NULL.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

伏妖词 2024-08-09 10:31:55

我认为类似(2)的东西确实是最好的选择——或者,至少,如果你从头开始的话,它会是最好的选择。在SQL中,NULL表示未知。如果你想要一些其他的含义,你真的应该使用一个特殊的值,而 0 肯定是一个不错的选择。

您应该在整个数据库中执行此操作,而不仅仅是这个表。那么你就不应该遇到奇怪的特殊情况。事实上,您应该能够删除当前的许多行(例如:当前,如果您想要没有过滤器的摘要行,则有特殊情况“过滤器为空”,而不是正常情况“filter = ?”。)

您还应该继续在引用的表中创建一个“不存在”条目,以保持 FK 约束有效(并避免特殊情况)。

PS:没有主键的表不是关系表,应该避免。

编辑 1

嗯,在这种情况下,您实际上需要重复密钥更新吗?如果您正在执行 INSERT ... SELECT,那么您可能会这样做。但是,如果您的应用程序提供数据,则只需手动执行即可 - 执行更新(将 zip = null 映射到 zip is null),检查更改了多少行( MySQL 返回此),如果 0 则执行插入。

I think something along the lines of (2) is really the best bet — or, at least, it would be if you were starting from scratch. In SQL, NULL means unknown. If you want some other meaning, you really ought to use a special value for that, and 0 is certainly an OK choice.

You should do this across the entire database, not just this one table. Then you shouldn't wind up with weird special cases. In fact, you should be able to get rid of a lot of your current ones (example: currently, if you want the summary row where there is no filter, you have the special case "filter is null" as opposed to the normal case "filter = ?".)

You should also go ahead and create a "not present" entry in the referred-to table as well, to keep the FK constraint valid (and avoid special cases).

PS: Tables w/o a primary key are not relational tables and should really be avoided.

edit 1

Hmmm, in that case, do you actually need the on duplicate key update? If you're doing a INSERT ... SELECT, then you probably do. But if your app is supplying the data, just do it by hand — do the update (mapping zip = null to zip is null), check how many rows were changed (MySQL returns this), if 0 do an insert.

那一片橙海, 2024-08-09 10:31:55

使用现代版本的 MariaDB(以前的 MySQL),如果您使用代理列路由#5,则可以通过重复键更新语句上的插入来简单地完成更新插入。添加 MySQL 生成的存储列或 MariaDB 持久虚拟列以对可为空字段应用唯一性约束,间接将无意义数据排除在数据库之外,以换取一些膨胀。

例如,

CREATE TABLE IF NOT EXISTS bar (
    id INT PRIMARY KEY AUTO_INCREMENT,
    datebin DATE NOT NULL,
    baz1_id INT DEFAULT NULL,
    vbaz1_id INT AS (COALESCE(baz1_id, -1)) STORED,
    baz2_id INT DEFAULT NULL,
    vbaz2_id INT AS (COALESCE(baz2_id, -1)) STORED,
    blam DOUBLE NOT NULL,
    UNIQUE(datebin, vbaz1_id, vbaz2_id)
);

INSERT INTO bar (datebin, baz1_id, baz2_id, blam)
    VALUES ('2016-06-01', null, null, 777)
ON DUPLICATE KEY UPDATE
    blam = VALUES(blam);

对于 MariaDB,将 STORED 替换为 PERSISTENT,索引需要持久性。

MySQL 生成的列
MariaDB 虚拟列

With modern versions of MariaDB (formerly MySQL), upserts can be done simply with insert on duplicate key update statements if you go with surrogate column route #5. Adding MySQL's generated stored columns or MariaDB persistent virtual columns to apply the uniqueness constraint on the nullable fields indirectly keeps nonsense data out of the database in exchange for some bloat.

e.g.

CREATE TABLE IF NOT EXISTS bar (
    id INT PRIMARY KEY AUTO_INCREMENT,
    datebin DATE NOT NULL,
    baz1_id INT DEFAULT NULL,
    vbaz1_id INT AS (COALESCE(baz1_id, -1)) STORED,
    baz2_id INT DEFAULT NULL,
    vbaz2_id INT AS (COALESCE(baz2_id, -1)) STORED,
    blam DOUBLE NOT NULL,
    UNIQUE(datebin, vbaz1_id, vbaz2_id)
);

INSERT INTO bar (datebin, baz1_id, baz2_id, blam)
    VALUES ('2016-06-01', null, null, 777)
ON DUPLICATE KEY UPDATE
    blam = VALUES(blam);

For MariaDB replace STORED with PERSISTENT, indexes require persistence.

MySQL Generated Columns
MariaDB Virtual Columns

故事与诗 2024-08-09 10:31:55

将 DEFAULT NULL 列更改为 DEFAULT 0,这样可以使 UNIQUE KEY 一致匹配。这具有负面影响,即使针对汇总表的查询的开发过于复杂。它迫使我们使用大量的“CASE filter_id = 0 THEN NULL ELSE filter_id END”,并且由于所有其他表的 filter_id 都具有实际的 NULL,因此导致了尴尬的连接。

创建一个返回“CASE filter_id = 0 THEN NULL ELSE filter_id END”的视图,并使用该视图而不是直接使用表。摘要表包含几十万行,并且我被告知视图性能非常差。

MySQL 5.x 中的视图性能会很好,因为视图除了用空值替换零之外什么也不做。除非您在视图中使用聚合/排序,否则针对视图的大多数查询都将由查询优化器重写以仅命中基础表。

当然,由于它是 FK,因此您必须在引用的表中创建一个 id 为零的条目。

Change the DEFAULT NULL column to DEFAULT 0, which allows the UNIQUE KEY to be matched consistently. This has the negative side effect of overly complicating the development of queries against the summary table. It forces us to use a lot of "CASE filter_id = 0 THEN NULL ELSE filter_id END", and makes for awkward joining since all of the other tables have actual NULLs for the filter_id.

Create a view which returns "CASE filter_id = 0 THEN NULL ELSE filter_id END", and using this view instead of the table directly. The summary table contains a few hundred thousand rows, and I've been told view performance is quite poor.

View performance in MySQL 5.x will be fine, as the view does nothing but replace a zero with a null. Unless you use aggregates/sorts in a view, most any query against the view will be re-written by the query optimizer to just hit the underlying table.

And of course, since it's an FK, you'll have to create an entry in the referred-to table with an id of zero.

云归处 2024-08-09 10:31:55

我迟到了十多年,但我觉得我的解决方案应该是这里的答案,因为我遇到了完全相同的问题,并且这对我有用。如果您知道需要更新哪些内容,则可以在现有汇总查询之前手动更新它们,然后忽略现有查询中 filter_id 为 null 的所有情况,这样它就不会再次作为记录插入。

对于你的例子:

UPDATE `Summary` s
    LEFT JOIN `Activity` a
    ON s.`campaign_id` = a.`campaign_id`
SET s.`sessions`     = a.COUNT(`session_id`)                ,
SET s.`transactions` = a.COUNT(`transaction_id` IS NOT NULL)
WHERE s.`day`         = a.`day`
AND   s.`campaign_id` = a.`campaign_id`
AND   s.`filter_id` IS NULL
AND   a.`filter_id` IS NULL;

INSERT INTO `Summary` 
    (`day`, `campaign_id`, `filter_id`, `sessions`, `transactions`)
    SELECT `day`, `campaign_id`, `filter_id`
        , COUNT(`session_id`) AS `sessions`
        , COUNT(`transaction_id` IS NOT NULL) AS `transactions`
    FROM Activity
    WHERE `filter_id` IS NOT NULL
    GROUP BY `day`, `campaign_id`, `filter_id`
ON DUPLICATE KEY UPDATE
    `sessions` = VALUES(`sessions`)
    , `transactions` = VALUES(`transactions`);

I'm more than a decade late, but I feel my solution should be an answer on here as I had this exact same problem, and this worked for me. If you know what's got to be updated, you can update them manually just before your existing summarization query, then ignore all cases where filter_id is null in your existing query so it won't get inserted as a record again.

For your example:

UPDATE `Summary` s
    LEFT JOIN `Activity` a
    ON s.`campaign_id` = a.`campaign_id`
SET s.`sessions`     = a.COUNT(`session_id`)                ,
SET s.`transactions` = a.COUNT(`transaction_id` IS NOT NULL)
WHERE s.`day`         = a.`day`
AND   s.`campaign_id` = a.`campaign_id`
AND   s.`filter_id` IS NULL
AND   a.`filter_id` IS NULL;

INSERT INTO `Summary` 
    (`day`, `campaign_id`, `filter_id`, `sessions`, `transactions`)
    SELECT `day`, `campaign_id`, `filter_id`
        , COUNT(`session_id`) AS `sessions`
        , COUNT(`transaction_id` IS NOT NULL) AS `transactions`
    FROM Activity
    WHERE `filter_id` IS NOT NULL
    GROUP BY `day`, `campaign_id`, `filter_id`
ON DUPLICATE KEY UPDATE
    `sessions` = VALUES(`sessions`)
    , `transactions` = VALUES(`transactions`);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文