重构使用 row_number() 返回具有唯一列值的行的 tsql 视图
我有一个 sql 视图,用于检索数据。假设这是一个很大的产品列表,这些产品与购买它们的客户相关联。该视图应该只为每个产品返回一行,无论它链接到多少个客户。我正在使用 row_number 函数来实现此目的。 (这个例子是简化的,一般情况是一个查询,其中对于某个列 X 的每个唯一值应该只返回一行。返回哪一行并不重要)
CREATE VIEW productView AS
SELECT * FROM
(SELECT
Row_number() OVER(PARTITION BY products.Id ORDER BY products.Id) AS product_numbering,
customer.Id
//various other columns
FROM products
LEFT OUTER JOIN customer ON customer.productId = prodcut.Id
//various other joins
) as temp
WHERE temp.prodcut_numbering = 1
现在让我们说这个视图中的总行数约为 100 万,运行 select * from ProductView 需要 10 秒。执行诸如 select * from ProductView where ProductID = 10 之类的查询需要相同的时间。我相信这是因为查询被评估为这个
SELECT * FROM
(SELECT
Row_number() OVER(PARTITION BY products.Id ORDER BY products.Id) AS product_numbering,
customer.Id
//various other columns
FROM products
LEFT OUTER JOIN customer ON customer.productId = prodcut.Id
//various other joins
) as temp
WHERE prodcut_numbering = 1 and prodcut.Id = 10
我认为这导致内部子查询每次都被完整评估。理想情况下,我想使用以下内容
SELECT
Row_number() OVER(PARTITION BY products.productID ORDER BY products.productID) AS product_numbering,
customer.id
//various other columns
FROM products
LEFT OUTER JOIN customer ON customer.productId = prodcut.Id
//various other joins
WHERE prodcut_numbering = 1
,但这似乎是不允许的。有没有办法做类似的事情?
编辑 -
经过多次实验,我相信我遇到的实际问题是如何强制连接恰好返回 1 行。我尝试使用外部应用,如下所示。一些示例代码。
CREATE TABLE Products (id int not null PRIMARY KEY)
CREATE TABLE Customers (
id int not null PRIMARY KEY,
productId int not null,
value varchar(20) NOT NULL)
declare @count int = 1
while @count <= 150000
begin
insert into Customers (id, productID, value)
values (@count,@count/2, 'Value ' + cast(@count/2 as varchar))
insert into Products (id)
values (@count)
SET @count = @count + 1
end
CREATE NONCLUSTERED INDEX productId ON Customers (productID ASC)
对于上述示例集,下面的“获取所有内容”查询
select * from Products
outer apply (select top 1 *
from Customers
where Products.id = Customers.productID) Customers
需要约 1000 毫秒才能运行。添加显式条件:
select * from Products
outer apply (select top 1 *
from Customers
where Products.id = Customers.productID) Customers
where Customers.value = 'Value 45872'
花费相同的时间。对于一个相当简单的查询来说,这 1000 毫秒已经太多了,并且在添加其他类似的连接时会以错误的方式(向上)扩展。
I have a sql view, which I'm using to retrieve data. Lets say its a large list of products, which are linked to the customers who have bought them. The view should return only one row per product, no matter how many customers it is linked to. I'm using the row_number function to achieve this. (This example is simplified, the generic situation would be a query where there should only be one row returned for each unique value of some column X. Which row is returned is not important)
CREATE VIEW productView AS
SELECT * FROM
(SELECT
Row_number() OVER(PARTITION BY products.Id ORDER BY products.Id) AS product_numbering,
customer.Id
//various other columns
FROM products
LEFT OUTER JOIN customer ON customer.productId = prodcut.Id
//various other joins
) as temp
WHERE temp.prodcut_numbering = 1
Now lets say that the total number of rows in this view is ~1 million, and running select * from productView takes 10 seconds. Performing a query such as select * from productView where productID = 10 takes the same amount of time. I believe this is because the query gets evaluated to this
SELECT * FROM
(SELECT
Row_number() OVER(PARTITION BY products.Id ORDER BY products.Id) AS product_numbering,
customer.Id
//various other columns
FROM products
LEFT OUTER JOIN customer ON customer.productId = prodcut.Id
//various other joins
) as temp
WHERE prodcut_numbering = 1 and prodcut.Id = 10
I think this is causing the inner subquery to be evaluated in full each time. Ideally I'd like to use something along the following lines
SELECT
Row_number() OVER(PARTITION BY products.productID ORDER BY products.productID) AS product_numbering,
customer.id
//various other columns
FROM products
LEFT OUTER JOIN customer ON customer.productId = prodcut.Id
//various other joins
WHERE prodcut_numbering = 1
But this doesn't seem to be allowed. Is there any way to do something similar?
EDIT -
After much experimentation, the actual problem I believe I am having is how to force a join to return exactly 1 row. I tried to use outer apply, as suggested below. Some sample code.
CREATE TABLE Products (id int not null PRIMARY KEY)
CREATE TABLE Customers (
id int not null PRIMARY KEY,
productId int not null,
value varchar(20) NOT NULL)
declare @count int = 1
while @count <= 150000
begin
insert into Customers (id, productID, value)
values (@count,@count/2, 'Value ' + cast(@count/2 as varchar))
insert into Products (id)
values (@count)
SET @count = @count + 1
end
CREATE NONCLUSTERED INDEX productId ON Customers (productID ASC)
With the above sample set, the 'get everything' query below
select * from Products
outer apply (select top 1 *
from Customers
where Products.id = Customers.productID) Customers
takes ~1000ms to run. Adding an explicit condition:
select * from Products
outer apply (select top 1 *
from Customers
where Products.id = Customers.productID) Customers
where Customers.value = 'Value 45872'
Takes some identical amount of time. This 1000ms for a fairly simple query is already too much, and scales the wrong way (upwards) when adding additional similar joins.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
使用通用表表达式 (CTE) 尝试以下方法。根据您提供的测试数据,它会在不到一秒的时间内返回特定的 ProductId。
Try the following approach, using a Common Table Expression (CTE). With the test data you provided, it returns specific ProductIds in less than a second.
如果您执行以下操作该怎么办:
那么 ProductId 上的过滤器应该会有所帮助。不过,如果不进行过滤,情况可能会更糟。
What if you did something like:
Then the filter on productId should help. It might be worse without filtering, though.
问题是你的数据模型有缺陷。您应该有三个表:
此外,销售表可能应该拆分为一对多(Sales 和 SalesDetails)。除非你修复你的数据模型,否则你只会绕着尾巴转圈,追逐转移注意力的问题。如果系统不是您的设计,请修复它。如果老板不让你修,那就修吧。如果你无法修复它,那就修复它。对于您提出的错误数据模型,没有简单的解决方法。
The problem is that your data model is flawed. You should have three tables:
Furthermore, the sale table should probably be split into 1-to-many (Sales and SalesDetails). Unless you fix your data model you're just going to run circles around your tail chasing red-herring problems. If the system is not your design, fix it. If the boss doesn't let your fix it, then fix it. If you cannot fix it, then fix it. There isn't a easy way out for the bad data model you're proposing.
如果您真的不在乎带回哪个客户,这可能已经足够快了
this will probably be fast enough if you really don't care which customer you bring back