用于选择首选糖果的高效 SQL 2000 查询
(我希望我能想出一个更具描述性的标题...如果您能说出我要询问的查询类型,请建议一个或编辑这篇文章)
数据库:SQL Server 2000
示例数据(假设 500,000 行):
Name Candy PreferenceFactor Jim Chocolate 1.0 Brad Lemon Drop .9 Brad Chocolate .1 Chris Chocolate .5 Chris Candy Cane .5 499,995 more rows...
请注意,具有给定“名称”的行数是无限的。
期望的查询结果:
Jim Chocolate 1.0 Brad Lemon Drop .9 Chris Chocolate .5 ~250,000 more rows...
(由于 Chris 对拐杖糖和巧克力的偏好相同,因此一致的结果就足够了)。
问题: 如何从数据中选择名称、糖果,其中每个结果行都包含唯一的名称,以便所选糖果的每个名称具有最高的 PreferenceFactor。 (快速有效的答案优先)。
表上需要哪些索引? 如果 Name 和 Candy 是另一个表的整数索引(除了需要一些联接之外),这会有什么不同吗?
(I wish I could have come up with a more descriptive title... suggest one or edit this post if you can name the type of query I'm asking about)
Database: SQL Server 2000
Sample Data (assume 500,000 rows):
Name Candy PreferenceFactor Jim Chocolate 1.0 Brad Lemon Drop .9 Brad Chocolate .1 Chris Chocolate .5 Chris Candy Cane .5 499,995 more rows...
Note that the number of rows with a given 'Name' is unbounded.
Desired Query Results:
Jim Chocolate 1.0 Brad Lemon Drop .9 Chris Chocolate .5 ~250,000 more rows...
(Since Chris has equal preference for Candy Cane and Chocolate, a consistent result is adequate).
Question:
How do I Select Name, Candy from data where each resulting row contains a unique Name such that the Candy selected has the highest PreferenceFactor for each Name. (speedy efficient answers preferred).
What indexes are required on the table? Does it make a difference if Name and Candy are integer indexes into another table (aside from requiring some joins)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
您会发现以下查询优于给出的所有其他答案,因为它适用于单次扫描。 这模拟了 MS Access 的 First 和 Last 聚合函数,这基本上就是您正在做的事情。
当然,您的 CandyPreference 表中可能会使用外键而不是名称。 为了回答您的问题,实际上最好将 Candy 和 Name 作为另一个表的外键。
如果 CandyPreferences 表中还有其他列,那么拥有包含相关列的覆盖索引将产生更好的性能。 使列尽可能小将增加每页的行数并再次提高性能。 如果您最常使用 WHERE 条件执行查询来限制行,那么覆盖 WHERE 条件的索引就变得很重要。
Peter 的做法是正确的,但存在一些不必要的复杂性。
实际上我不推荐这种方法,除非性能至关重要。 “规范”的方法是 OrbMan 的标准 Max/GROUP BY 派生表,然后连接到它以获取选定的行。 但是,当有多个列参与 Max 的选择,并且选择器的最终组合可以重复时,即当没有列提供任意唯一性时(如这里的情况),该方法开始变得困难如果 PreferenceFactor 相同,我们就使用该名称。
编辑:最好提供更多使用说明,以帮助提高清晰度并帮助人们避免问题。
You will find that the following query outperforms every other answer given, as it works with a single scan. This simulates MS Access's First and Last aggregate functions, which is basically what you are doing.
Of course, you'll probably have foreign keys instead of names in your CandyPreference table. To answer your question, it is in fact very much best if Candy and Name are foreign keys into another table.
If there are other columns in the CandyPreferences table, then having a covering index that includes the involved columns will yield even better performance. Making the columns as small as possible will increase the rows per page and again increase performance. If you are most often doing the query with a WHERE condition to restrict rows, then an index that covers the WHERE conditions becomes important.
Peter was on the right track for this, but had some unneeded complexity.
I actually don't recommend this method unless performance is critical. The "canonical" way to do it is OrbMan's standard Max/GROUP BY derived table and then a join to it to get the selected row. Though, that method starts to become difficult when there are several columns that participate in the selection of the Max, and the final combination of selectors can be duplicated, that is, when there is no column to provide arbitrary uniqueness as in the case here where we use the name if the PreferenceFactor is the same.
Edit: It's probably best to give some more usage notes to help improve clarity and to help people avoid problems.
我尝试过:
这似乎有效,尽管没有真实数据和实际负载,我无法谈论效率。
不过,我确实在 PersonName 和 Candy 上创建了主键。 使用 SQL Server 2008 并且没有附加索引表明它使用了两个聚集索引扫描,因此情况可能会更糟。
我更多地使用了这个,因为我需要一个借口来使用“datadude”的数据生成计划功能。 首先,我重构了一个表,使其具有单独的糖果名称和人名表。 我这样做主要是因为它允许我使用测试数据生成而无需阅读文档。 架构变为:
查询变为:
对于 150,000 颗糖果、200,000 个人和 500,000 个 CandyPreferences,查询花费了大约 12 秒并生成了 200,000 行。
下面的结果让我很惊讶。 我更改了查询以删除最终的“漂亮”联接:
现在,对于 200,000 行,这需要两到三秒的时间。
现在,需要明确的是,我在这里所做的一切都不是为了提高此查询的性能:我认为 12 秒是成功的。 它现在表示 90% 的时间都花在聚集索引搜索上。
I tried:
This seems to work, though I can't speak to efficiency without real data and a realistic load.
I did create a primary key over PersonName and Candy, though. Using SQL Server 2008 and no additional indexes shows it using two clustered index scans though, so it could be worse.
I played with this a bit more because I needed an excuse to play with the Data Generation Plan capability of "datadude". First, I refactored the one table to have separate tables for candy names and person names. I did this mostly because it allowed me to use the test data generation without having to read the documentation. The schema became:
The query became:
With 150,000 candies, 200,000 persons, and 500,000 CandyPreferences, the query took about 12 seconds and produced 200,000 rows.
The following result surprised me. I changed the query to remove the final "pretty" joins:
This now takes two or three seconds for 200,000 rows.
Now, to be clear, nothing I've done here has been meant to improve the performance of this query: I considered 12 seconds to be a success. It now says it spends 90% of its time in a clustered index seek.
对 Emtucifor 解决方案发表评论(因为我无法定期发表评论)
我喜欢这个解决方案,但有一些关于如何改进它的评论(在这种特定情况下)。
如果将所有内容都放在一张表中,则效果不大,但像 John Saunders 的解决方案中那样,只有很少的表,情况就会有所不同。
当我们处理 [CandyPreferences] 表中的数字时,我们可以使用数学运算而不是串联来获取最大值。
我建议 PreferenceFactor 使用十进制而不是实数,因为我相信我们在这里不需要实际数据类型的大小,甚至进一步我建议十进制(n,n),其中n<10仅将小数部分存储在5个字节中。 假设十进制(3,3)就足够了(1000级偏好因子),我们可以做简单的
PackedData = Max(PreferenceFactor + CandyID)
此外,如果我们知道我们有少于 1,000,000 个 CandyID,我们可以添加转换为:
PackedData = Max( Cast(PreferenceFactor + CandyID asdecimal(9,3)))
允许 sql server 在临时表中使用 5 个字节
使用底函数拆包非常简单快捷。
Niikola
-- 后来添加 ---
我测试了 John 和 Emtucifor 的两种解决方案(修改为使用 John 的结构并使用我的建议)。 我还测试了有连接和没有连接的情况。
Emtucifor 的解决方案显然获胜,但利润并不大。 如果 SQL Server 必须执行一些物理读取,情况可能会有所不同,但在所有情况下它们都是 0。
以下是查询:
以及结果:
Comment on Emtucifor solution (as I cant make regular comments)
I like this solution, but have some comments how it could be improved (in this specific case).
It can't be done much if you have everything in one table, but having few tables as in John Saunders' solution will make things a bit different.
As we are dealing with numbers in [CandyPreferences] table we can use math operation instead of concatenation to get max value.
I suggest PreferenceFactor to be decimal instead of real, as I believe we don't need here size of real data type, and even further I would suggest decimal(n,n) where n<10 to have only decimal part stored in 5 bytes. Assume decimal(3,3) is enough (1000 levels of preference factor), we can do simple
PackedData = Max(PreferenceFactor + CandyID)
Further, if we know we have less than 1,000,000 CandyIDs we can add cast as:
PackedData = Max(Cast(PreferenceFactor + CandyID as decimal(9,3)))
allowing sql server to use 5 bytes in temporary table
Unpacking is easy and fast using floor function.
Niikola
-- ADDED LATER ---
I tested both solutions, John's and Emtucifor's (modified to use John's structure and using my suggestions). I tested also with and without joins.
Emtucifor's solution clearly wins, but margins are not huge. It could be different if SQL server had to perform some Physical reads, but they were 0 in all cases.
Here are the queries:
And the results:
您可以使用以下 select 语句
,但通过此选择,您将在结果集中获得“Chris”2 次。
如果您想获得用户最喜欢的食物而不是使用
我认为将名称和糖果更改为整数类型可能会帮助您提高性能。 您还应该在两列上插入索引。
[编辑]已更改! 到 @
you could use following select statements
but with this select you will get "Chris" 2 times in your result set.
if you want to get the the most preferred food by user than use
i think changing the name and candy to integer types might help you improve performance. you also should insert indexes on both columns.
[Edit] changed ! to @
通常需要对经常包含在 where 子句中的列建立索引。 在这种情况下,我想说名称和糖果列的索引将具有最高优先级。
列的查找表通常取决于列中重复值的数量。 在 250,000 行中,如果只有 50 个重复值,则确实需要在那里有整数引用(外键)。 在这种情况下,应该进行糖果引用,而名称引用实际上取决于数据库中不同人员的数量。
usually indexing is required on columns which are frequently included in where clause. In this case I would say indexing on name and candy columns would be of highest priority.
Having lookup tables for columns usually depends on number of repeating values with in columns. Out of 250,000 rows, if there are only 50 values that are repeating, you really need to have integer reference (foreign key) there. In this case, candy reference should be done and name reference really depends on the number of distinct people within the database.
我将您的列名称更改为 PersonName 以避免任何常见的保留字冲突。
I changed your column Name to PersonName to avoid any common reserved word conflicts.
这将为您提供与给定名称匹配的 PrefFactor 的行。
(例如,如果约翰对于柠檬和巧克力的 HighPref 为 1)。
请原谅我的回答,因为我是在没有 SQL 查询分析器的情况下编写的。
This should give you rows with matching PrefFactor for a given Name.
(e.g. if John as a HighPref of 1 for Lemon & Chocolate).
Pardon my answer as I am writing it without SQL Query Analyzer.
像这样的事情会起作用:
性能应该与方法一样好。 检查您的查询计划。
相关子查询中的 TOP 1 ... ORDER BY 允许我们为外部查询中每行返回的行指定任意规则。 在这种情况下,我们希望每个名字的偏好系数最高,并在决胜局中提供糖果。
子查询只能返回一个值,所以我们必须将糖果和偏好因子合并到一个字段中。 这里分号只是为了提高可读性,但在其他情况下,您可以使用它来解析外部查询中与 CHARINDEX 的组合字段。
如果您想要输出完全精确,您可以使用它(假设preferencefactor是一个浮点数):
然后用以下命令解析它:
Something like this would work:
Performance should be decent with with method. Check your query plan.
TOP 1 ... ORDER BY in a correlated subquery allows us to specify arbitrary rules for which row we want returned per row in the outer query. In this case, we want the highest preference factor per name, with candy for tie-breaks.
Subqueries can only return one value, so we must combine candy and preference factor into one field. The semicolon is just for readability here, but in other cases, you might use it to parse the combined field with CHARINDEX in the outer query.
If you wanted full precision in the output, you could use this instead (assuming preferencefactor is a float):
And then parse it back with:
我还测试了 ROW_NUMBER() 版本 + 添加了附加索引
创建索引 IX_z5CandyPreferences On z5CandyPreferences(PersonId,PrefernceFactor,CandyID)
Emtucifor 和 ROW_NUMBER() 版本(带有索引)之间的响应时间是边际的(如果有的话 - 测试应该重复几次次并取平均值,但我预计不会有任何显着差异)
这是查询:
以及带有和不带有新索引的结果:
I tested also ROW_NUMBER() version + added additional index
Create index IX_z5CandyPreferences On z5CandyPreferences(PersonId,PrefernceFactor,CandyID)
Response times between Emtucifor's and ROW_NUMBER() version (with index in place) are marginal (if any - test should be repeated number of times and take averages, but I expect there would not be any significant difference)
Here is query:
and results with and without new index: