SQL 查询 - 选择“上次更新”记录在一组中,更好的数据库设计?
假设我有一个包含 3 个表的 MySQL 数据库:
表 1:Persons,有 1 列 ID (int)
表 2:时事通讯,具有 1 列 ID (int)
表 3:订阅,其中包含列 Person_ID (int)、Newsletter_ID (int)、Subscribed (bool)、Updated (Datetime)
Subscriptions.Person_ID 指向个人,Subscription.Newsletter_ID 指向新闻通讯。因此,每个人可能同时订阅 0 份或更多杂志。表订阅还将存储每个人订阅每个新闻通讯的完整历史记录。如果特定的 Person_ID-Newsletter_ID 对在 Subscriptions 表中没有行,则相当于该对的订阅状态为“false”。
以下是示例数据集
Persons
ID
1
2
3
Newsletters
ID
1
2
3
Subscriptions
Person_ID Newsletter_ID Subscribed Updated
2 1 true 2010-05-01
3 1 true 2010-05-01
3 2 true 2010-05-10
3 1 false 2010-05-15
因此,截至 2010 年 5 月 16 日,人员 1 没有订阅,人员 2 订阅了新闻通讯 1,人员 3 订阅了新闻通讯 2。人员 3 订阅了新闻通讯 1,持续时间为有一段时间,但不再是了。
我正在尝试做两种查询。
显示截至查询时间每个人的活动订阅的查询(我们可以假设将来永远不会更新 - 因此,这意味着返回每个 Person_ID-Newsletter_ID 对的最新“更新”值的记录,如下所示只要 Subscribed 为 true(如果 Person_ID-Newsletter_ID 对的最新记录的 Subscribed 状态为 false,那么我不希望返回该记录))。
返回特定新闻通讯的所有活跃订阅的查询 - 与 1. 中有关“已订阅”列中包含“false”的记录的限定条件相同。
我使用 SQL/数据库的频率不够高,无法判断此设计是否良好,或者所需的 SQL 查询在订阅表中有 1M 条记录的数据库上是否会很慢。
我在 Visual Studio 2010 中使用可视化查询生成器工具,但我什至无法让查询返回每个 Person_ID-Newsletter_ID 对的最新更新记录。
是否有可能提出不涉及使用子查询的 SQL 查询(大概是因为对于较大的数据集它们会变得太慢)?如果不是,那么拥有一个单独的 Subscriptions_History 表会是一个更好的设计,并且每次将 Person_ID-Newsletter-ID 对的订阅状态添加到 Subscriptions 时,该对的任何现有记录都会移至 Subscriptions_History(这样 Subscriptions表只包含任何 Person_ID-Newsletter_ID 对的最新状态更新)?
我在 Windows 上使用 .net,那么使用 Linq 执行此类查询是否会更容易(或相同或更困难)?实体框架?
编辑:如果我使用此查询,会发生以下情况:
SELECT Person_ID, Newsletter_ID, Allocation, Updated, MAX(Updated) AS Expr1
FROM subscriptions
GROUP BY Person_ID, Newsletter_ID
我从订阅表中获得混杂在一起的第 2 行和第 4 行(在下面结果集的第 2 行中):
Person_ID Newsletter_ID Subscribed Updated Expr1
2 1 true 2010-05-01 2010-05-01
3 1 true 2010-05-01 2010-05-15
3 2 true 2010-05-10 2010-05-10
谢谢!
Let's say I have a MySQL database with 3 tables:
table 1: Persons, with 1 column ID (int)
table 2: Newsletters, with 1 column ID (int)
table 3: Subscriptions, with columns Person_ID (int), Newsletter_ID (int), Subscribed (bool), Updated (Datetime)
Subscriptions.Person_ID points to a Person, and Subscription.Newsletter_ID points to a Newsletter. Thus, each person may have 0 or more subscriptions to 0 or more magazines at once. The table Subscriptions will also store the entire history of each person's subscriptions to each newsletter. If a particular Person_ID-Newsletter_ID pair doesn't have a row in the Subscriptions table, then it's equivalent to that pair having a subscription status of 'false'.
Here is a sample dataset
Persons
ID
1
2
3
Newsletters
ID
1
2
3
Subscriptions
Person_ID Newsletter_ID Subscribed Updated
2 1 true 2010-05-01
3 1 true 2010-05-01
3 2 true 2010-05-10
3 1 false 2010-05-15
Thus, as of 2010-05-16, Person 1 has no subscription, Person 2 has a subscription to Newsletter 1, and Person 3 has a subscription to Newsletter 2. Person 3 had a subscription to Newsletter 1 for a while, but not anymore.
I'm trying to do 2 kinds of query.
A query that shows everyone's active subscriptions as of query time (we can assume that updated will never be in the future -- thus, this means returning the record with the latest 'updated' value for each Person_ID-Newsletter_ID pair, as long as Subscribed is true (if the latest record for a Person_ID-Newsletter_ID pair has a Subscribed status of false, then I don't want that record returned)).
A query that returns all active subscriptions for a specific newsletter - same qualification as in 1. regarding records with 'false' in the Subscribed column.
I don't use SQL/databases often enough to tell if this design is good, or if the SQL queries needed would be slow on a database with, say, 1M records in the Subscriptions table.
I was using the Visual query builder tool in Visual Studio 2010 but I can't even get the query to return the latest updated record for each Person_ID-Newsletter_ID pair.
Is it possible to come up with SQL queries that don't involve using subqueries (presumably because they would become too slow with a larger data set)? If not, would it be a better design to have a separate Subscriptions_History table, and every time a subscription status for a Person_ID-Newsletter-ID pair is added to Subscriptions, any existing record for that pair is moved to Subscriptions_History (that way the Subscriptions table only ever contains the latest status update for any Person_ID-Newsletter_ID pair)?
I'm using .net on Windows, so would it be easier (or the same, or harder) to do this kind of queries using Linq? Entity Framework?
Edit: Here's what happens if I use this query:
SELECT Person_ID, Newsletter_ID, Allocation, Updated, MAX(Updated) AS Expr1
FROM subscriptions
GROUP BY Person_ID, Newsletter_ID
I get rows 2 and 4 from the Subscriptions table mishmashed together (in row 2 of the results set below):
Person_ID Newsletter_ID Subscribed Updated Expr1
2 1 true 2010-05-01 2010-05-01
3 1 true 2010-05-01 2010-05-15
3 2 true 2010-05-10 2010-05-10
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我最近遇到了一个有点类似的问题。
我不是 SQL 专家,因此我无法就什么是最佳设计提供太多建议。但在专业人士介入之前,也许这会有所帮助:
请注意,我已在您的订阅表中添加了一个 ID 列(我将在稍后解释原因)。
现在,让我们分解一下它是如何工作的(或者我认为它是如何工作的,无论如何;如果我错了,我很高兴得到纠正)。
首先,您检索给定人员/新闻通讯的所有记录。这就是子查询的作用(是的,我知道你说过你不想有子查询,但我不确定你是否可以在没有子查询的情况下做到这一点)。我按 person_id 和 newsletter_id 进行分组。这可以返回多行。请注意,我选择了 MAX(ID)。如果您使用自动增量 ID 并且可以安全地假设 ID 列中编号最高的行是该组的最新行(即,如果您不手动插入 ID),则此子查询将为您提供最后一个的 ID每个人/时事通讯的行。
因此,您可以将其与订阅表连接起来:连接条件是订阅行的 ID 必须与您从子查询中检索到的 MAX id 相匹配。在这里,您仅考虑每个新闻通讯/个人的最新记录。然后,您可以使用 WHERE 条件排除不活动的订阅。
如果要将结果限制为给定的新闻通讯(或给定的人),请将该条件添加到 WHERE 子句中。
索引应该有助于使该查询运行得更快。
希望这有帮助。
已添加
如果由于某种原因您不能保证 MAX(Subscriptions.ID) 将对应于最后插入的行,您可能可以执行类似的操作(我认为遵循相同的逻辑,但是有点冗长,可能效率较低):
新编辑
再想一想,我添加的替代方案(带有
MAX(Updated)
的方案)是错误的,我思考。您无法确定子查询中所选的 Newsletter_ID 和 Person_ID 将是与 MAX(Updated) 行对应的 Newsletter_ID 和 Person_ID。由于这些列用于连接条件,因此该查询可能会给出虚假结果。I've recently run into a somewhat similar problem.
I'm not an SQL expert, so I can't really give much advice on what's the best design for this. But until the pros chip in, maybe this helps:
Notice I've added an ID colum to your subscriptions table (I'll explain why in a sec).
Now, let's break down how this works (or how I think it works, anyway; I'd be glad to be corrected if I'm wrong).
First, you retrieve all records for a given person / newsletter. This is what the subquery does (yes, I know you said you'd rather not have subqueries, but I'm not sure you can do it without one). I'm grouping by person_id and newsletter_id. This can return more than one row. Notice I'm selecting MAX(ID). If you use an autoincremental ID and it's safe to assume the row with the highest number in the ID column is the newest one for the group (i.e. if you don't insert ID's manually), this subquery will get you the ID of the last row for each person / newsletter.
So, you can join this with the subscriptions table: the join condition is that the ID of the subscriptions row has to match the MAX id you retrieved from the subquery. Here you are only considering the most recent record for each newsletter/person. Then, you factor out inactive subscriptions by using a WHERE condition.
If you want to restrict the result to a given newsletter (or a given person), add that condition to the WHERE clause.
Indices should help making this query run faster.
Hope this helps.
Added
If for some reason you can't guarantee that MAX(Subscriptions.ID) will correspond to the last inserted row, you can probably do something like this (which follows the same logic, I think, but is a bit more verbose and probably less efficient):
New edit
On second thoughts, the alternative I've added (the one with
MAX(Updated)
) is wrong, I think. You can't know for sure the selected Newsletter_ID and Person_ID in the subquery will be the Newsletter_ID and Person_ID corresponding to the MAX(Updated) row. Since these columns are used for the join condition, this query could give bogus results.将您的
Subscriptions
分成 2 个表:true
的订阅):Person_Id
|Newsletter_Id
separate your
Subscriptions
into 2 tables:true
for now):Person_Id
|Newsletter_Id
有序分析函数”是解决此类问题的标准方法。1M 条记录,没问题...当然,这取决于您机器的能力。MAX
(Updated) OVER(PARTITION BY 列表您想要“max”的字段
)
ordered analytical functions" is a standard method for this type of problem. 1M records, no problem... depending on the power of your machine, of course.
MAX( Updated) OVER( PARTITION BY
list of fields over which you want "max")
我觉得你的设计很不错。子查询没有固有的缓慢性 - 如果是表达查询的最佳方式,请使用它们。
这是获取所有最新(即未覆盖)指令的查询:
然后您可以使用此查询作为另一个查询的子查询来获取您想要的内容。对于查询 #1:
对于查询 #2:
您肯定需要在
Subscriptions
表中的Newsletter_ID
上建立索引,因为此查询可能非常有选择性。编辑:
糟糕,子查询中的 Subscriptions 列可以来自任意行,而不是生成 MAX(Updated) 的行。您必须重新加入原始表:
I think your design is pretty good. There is no inherent slowness for subqueries - use them if is the best way to express your query.
Here's the query that gets you all of the latest (i.e. not overridden) directives:
Then you can use this query as a subquery of another query to get what you want. For your query #1:
For query #2:
You'll definitely want an index on
Newsletter_ID
in theSubscriptions
table, as this query will likely be very selective.Edit:
Whoops, the Subscriptions column in the subquery can come from an arbitrary row, not the one that generates the MAX(Updated). You have to rejoin with the original table: