使用 ROW_NUMBER() 随机获取记录时出现问题 (SQL Server 2005)

发布于 2024-08-22 23:15:31 字数 996 浏览 4 评论 0原文

我想从表中随机获取 1000 条记录,所以我使用:

SELECT top 1000 
       mycol1
     , mycol2
     , ROW_NUMBER() OVER (ORDER BY NEWID()) rn
FROM mytable

但是,我不想在结果集中看到 rn,所以我这样做:

SELECT mycol1
     , mycol2
FROM (
    SELECT top 1000 
           mycol1
         , mycol2
         , ROW_NUMBER() OVER (ORDER BY NEWID()) rn
    FROM mytable
) a

当我这样做时,结果不会出现不再随机了。它们就像我刚刚说的前 10000 个没有使用 row_number() 进行随机化一样。

当我将查询更改为时,

SELECT mycol1
     , mycol2
     , rn
FROM (
    SELECT top 1000 
           mycol1
         , mycol2
         , ROW_NUMBER() OVER (ORDER BY NEWID()) rn
    FROM mytable
) a

它们又是随机的。

我猜 sql server 做了某种优化,说“嘿,这家伙无论如何都不需要列 rn,所以忽略它”。但在本例中这会导致意外的行为。有什么办法可以避免这种情况吗?

PS:我使用 ROW_NUMBER() 技巧,因为 mytable 有 10 mio。行并

SELECT top 10000 *
FROM mytable
ORDER BY NEWID()

永远运行,而使用 ROW_NUMBER() 最多只需要 30 秒。

I want to get 1000 records from a table randomly, so I use:

SELECT top 1000 
       mycol1
     , mycol2
     , ROW_NUMBER() OVER (ORDER BY NEWID()) rn
FROM mytable

However, I don't want to see rn in my resultset, so I do:

SELECT mycol1
     , mycol2
FROM (
    SELECT top 1000 
           mycol1
         , mycol2
         , ROW_NUMBER() OVER (ORDER BY NEWID()) rn
    FROM mytable
) a

When I do this, the results do not come randomly anymore. They come as if I just said top 10000 without randomization using row_number().

When I change the query to

SELECT mycol1
     , mycol2
     , rn
FROM (
    SELECT top 1000 
           mycol1
         , mycol2
         , ROW_NUMBER() OVER (ORDER BY NEWID()) rn
    FROM mytable
) a

they are random again.

I guess sql server does some kind of optimization, saying "hey, this guy doesn't need the column rn anyway, so just ignore it". But this results to an unexpected behavior in this case. Is there any way to avoid this?

PS: I use the ROW_NUMBER() trick because mytable has 10 mio. rows and

SELECT top 10000 *
FROM mytable
ORDER BY NEWID()

runs forever, whereas with ROW_NUMBER() it takes only up to 30 secs.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

青衫负雪 2024-08-29 23:15:31

您还可以尝试在一些小的 where 子句中使用 rn 字段,例如

WHERE rn > 0 在外部查询中,这可能会强制编译器通过 RN 字段。

另外,我认为如果您想对全部数百万条记录进行随机采样,您的整体查询将成为一个问题。这只会获取“第一个磁盘外”记录块,虽然不能保证相同,但通常会是相同的 10000 个。

我建议在 MIN(PrimaryKey) 和 MAX( PrimaryKey),然后执行 WHERE PrimaryKey IN (...) 或类似操作

You could also try using the rn field in some petty where clause like

WHERE rn > 0 in your outer query which would maybe force the compiler to bring the RN field through.

Also I think your overall query is going to be an issue if you want to randomly sample your entire millions of records. This will only grab the "first off disk" block of records which while not guaranteed to be the same will more often than not be the same 10000.

I would suggest creating a set of 10,000 random numbers between MIN(PrimaryKey) and the MAX(PrimaryKey) and then doing a WHERE PrimaryKey IN (...) or similar

陌伤ぢ 2024-08-29 23:15:31

在外部查询中添加类似 Where rn Is Not Null 的内容,以便 rn 包含在查询计划中并且不会被优化掉

Add something like Where rn Is Not Null to the outer query so rn it is included in query plan and not optimised out

御守 2024-08-29 23:15:31

我也在为同样的问题而苦苦挣扎。我用 CROSS APPLY 和 TOP 解决了这个问题。请记住,CROSS APPLY 将我的外部表拉入派生表的范围,我知道必须有一种方法可以做到这一点。

以下代码会根据制造商添加 3(*) 个随机相关产品。

INSERT INTO     ProductGroup (
                    ParentId,
                    ChildId
                )
SELECT          DISTINCT
                P.ProductId,
                CandidateInner.ChildId
FROM            ProductRelated PR 
JOIN            Product P
ON              PR.ChildId = P.ProductId
CROSS APPLY     
                (   
                    SELECT      DISTINCT TOP 3
                                NewId() AS RandId,
                                Product.ManufacturerId,
                                ProductRelated.ChildId 
                    FROM        ProductRelated 
                    JOIN        Product 
                    ON          Product.ProductId = ProductRelated.ChildId
                    WHERE       ManufacturerId IS NOT NULL
                    AND         Product.ManufacturerId = P.ManufacturerId
                    ORDER BY    NewId()
                ) CandidateInner
LEFT JOIN       (
                    SELECT      DISTINCT TOP 100 PERCENT
                                ParentId,
                                COUNT(DISTINCT ChildId) AS Ct
                    FROM        ProductGroup 
                    GROUP BY    ParentId
                    HAVING      COUNT(DISTINCT ChildId) >= 3
                ) AlreadyGrouped
ON              P.ProductId = AlreadyGrouped.ParentId
WHERE           P.ProductId <> CandidateInner.ChildId
AND             AlreadyGrouped.ParentId IS NULL
ORDER BY        P.ProductId

*请注意,这将在以下 2 种情况下插入少于 3 个:

1) 其中有 < 3 制造商相关产品
2)(有问题)随机前 3 个返回相同产品的情况。

(1)上述情况是不可避免的。

我处理上面(2)的方法是运行两次然后删除重复项。虽然这还不是 100%,但从统计上看,已经足够满足我的要求了。这是每晚运行的脚本,但我仍然喜欢 <> 的速度。在 CROSS APPLY 之外 - 任何拉动检查范围的操作都会导致对制造商联接产生的派生表进行扫描,即使将其拉入内部将意味着 (2) 不再是问题,但与即时相比,它的速度非常慢适当的索引。

I was struggling with this same problem. I solved it with CROSS APPLY and TOP. Keeping in mind that CROSS APPLY pulls my outer table into scope for the derived table, I knew there had to be a way to do this.

The following code results in 3(*) random related products being added based on the manufacturer.

INSERT INTO     ProductGroup (
                    ParentId,
                    ChildId
                )
SELECT          DISTINCT
                P.ProductId,
                CandidateInner.ChildId
FROM            ProductRelated PR 
JOIN            Product P
ON              PR.ChildId = P.ProductId
CROSS APPLY     
                (   
                    SELECT      DISTINCT TOP 3
                                NewId() AS RandId,
                                Product.ManufacturerId,
                                ProductRelated.ChildId 
                    FROM        ProductRelated 
                    JOIN        Product 
                    ON          Product.ProductId = ProductRelated.ChildId
                    WHERE       ManufacturerId IS NOT NULL
                    AND         Product.ManufacturerId = P.ManufacturerId
                    ORDER BY    NewId()
                ) CandidateInner
LEFT JOIN       (
                    SELECT      DISTINCT TOP 100 PERCENT
                                ParentId,
                                COUNT(DISTINCT ChildId) AS Ct
                    FROM        ProductGroup 
                    GROUP BY    ParentId
                    HAVING      COUNT(DISTINCT ChildId) >= 3
                ) AlreadyGrouped
ON              P.ProductId = AlreadyGrouped.ParentId
WHERE           P.ProductId <> CandidateInner.ChildId
AND             AlreadyGrouped.ParentId IS NULL
ORDER BY        P.ProductId

*Note that this will insert fewer than 3 in the following 2 cases:

1) Where there are < 3 products related by manufacturer
2) (Problematic) Where the random top 3 returns the same product to itself.

(1) above is unavoidable.

The way I handled (2) above was to run this twice then delete duplicates. This is still not 100%, but statistically, it's more than sufficient for my requirement. This is in a nightly-run script, but I still like the speediness of having the <> outside of the CROSS APPLY - anything pulling that check in scope results in scans of the derived tables resulting from the manufacturer join, even though pulling it inside will mean that (2) is no longer an issue, but it's painfully slow vs. instant with proper indexes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文