SQL Server 跨行压缩

发布于 2024-08-03 21:39:09 字数 1025 浏览 5 评论 0原文

我必须以特定顺序返回约 70,000 行 4 列 INT,并且只能使用非常浅的缓存,因为所涉及的数据非常不稳定并且必须是最新的。数据的一个特性是,当它有序时,它通常是高度重复的。

我已经开始研究减少行数的各种方法,以减少网络带宽和客户端处理时间/资源,但没有设法在 T-SQL 中找到任何可以“压缩”重复行的技术分成一行和一个“计数”列。例如

prop1    prop2    prop3    prop4
--------------------------------
0        0        1        53
0        0        2        55
1        1        1        8
1        1        1        8
1        1        1        8
1        1        1        8
0        0        2        55
0        0        2        55
0        0        1        53

prop1    prop2    prop3    prop4    count
-----------------------------------------
0        0        1        53       1
0        0        2        55       1
1        1        1        8        4
0        0        2        55       2
0        0        1        53       1

我估计,如果这是可能的,在许多情况下,70,000 行的结果集将减少到最多几千行。

我是否在这里咆哮错误的树(是否有隐式压缩作为 SQL Server 协议的一部分)?

有没有办法做到这一点(SQL Server 2005)?

我有理由不应该这样做吗?

谢谢。

I'm having to return ~70,000 rows of 4 columns of INTs in a specific order and can only use very shallow caching as the data involved is highly volatile and has to be up to date. One property of the data is that it is often highly repetitive when it is in order.

I've started to look at various methods of reducing the row count in order to reduce network bandwidth and client side processing time/resources, but have not managed to find any kind of technique in T-SQL where I can 'compress' repetative rows down into a single row and a 'count' column. e.g.

prop1    prop2    prop3    prop4
--------------------------------
0        0        1        53
0        0        2        55
1        1        1        8
1        1        1        8
1        1        1        8
1        1        1        8
0        0        2        55
0        0        2        55
0        0        1        53

Into:

prop1    prop2    prop3    prop4    count
-----------------------------------------
0        0        1        53       1
0        0        2        55       1
1        1        1        8        4
0        0        2        55       2
0        0        1        53       1

I'd estimate that if this was possible, in many cases what would be a 70,000 row result set would be down to a few thousand at most.

Am I barking up the wrong tree here (is there implicit compression as part of the SQL Server protocol)?

Is there a way to do this (SQL Server 2005)?

Is there a reason I shouldn't do this?

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

无人问我粥可暖 2024-08-10 21:39:09

您可以使用count函数!这将要求您使用 group by 子句,在其中告诉 count 如何分解或 group 本身。 Gropu by 用于任何聚合函数 在 SQL 中。

select
    prop1,
    prop2,
    prop3,
    prop4,
    count(*) as count
from
    tbl
group by
    prop1,
    prop2,
    prop3,
    prop4,
    y,
    x
order by y, x

更新:OP提到这些是按yx排序的,而不是结果集的一部分。在这种情况下,您仍然可以使用 yx 作为 group by 的一部分。

请记住,如果没有排序列,顺序就没有任何意义,因此在这种情况下,我们必须使用 组中的 yx 来尊重这一点通过

You can use the count function! This will require you to use the group by clause, where you tell count how to break up, or group, itself. Gropu by is used for any aggregate function in SQL.

select
    prop1,
    prop2,
    prop3,
    prop4,
    count(*) as count
from
    tbl
group by
    prop1,
    prop2,
    prop3,
    prop4,
    y,
    x
order by y, x

Update: The OP mentioned these are ordered by y and x, not part of the result set. In this case, you can still use y and x as part of the group by.

Keep in mind that order means nothing if it doesn't have ordering columns, so in this case, we have to respect that with y and x in the group by.

要走就滚别墨迹 2024-08-10 21:39:09

这会起作用,尽管看起来很痛苦:

;WITH Ordering
AS
(
  SELECT Prop1,        
  Prop2,        
  Prop3,        
  Prop4,
  ROW_NUMBER() OVER (ORDER BY Y, X) RN
  FROM Props
)
SELECT 
  CurrentRow.Prop1, 
  CurrentRow.Prop2, 
  CurrentRow.Prop3, 
  CurrentRow.Prop4, 
  CurrentRow.RN - 
    ISNULL((SELECT TOP 1 RN FROM Ordering O3 WHERE RN < CurrentRow.RN AND (CurrentRow.Prop1 <> O3.Prop1 OR CurrentRow.Prop2 <> O3.Prop2 OR CurrentRow.Prop3 <> O3.Prop3 OR CurrentRow.Prop4 <> O3.Prop4) ORDER BY RN DESC), 0) Repetitions
FROM Ordering CurrentRow
LEFT JOIN Ordering O2 ON CurrentRow.RN + 1 = O2.RN
WHERE O2.RN IS NULL OR (CurrentRow.Prop1 <> O2.Prop1 OR CurrentRow.Prop2 <> O2.Prop2 OR CurrentRow.Prop3 <> O2.Prop3 OR CurrentRow.Prop4 <> O2.Prop4) 
ORDER BY CurrentRow.RN

要点如下:

  1. 使用 ROW_NUMBER OVER 枚举每一行以获得正确的顺序。
  2. 仅当下一行具有不同字段或下一行不存在时才通过连接查找每个周期的最大值。
  3. 通过获取当前行号(假定为本周期的最大值)并从中减去上一个周期的最大行号(如果存在)来计算重复次数。

This will work, though it is painful to look at:

;WITH Ordering
AS
(
  SELECT Prop1,        
  Prop2,        
  Prop3,        
  Prop4,
  ROW_NUMBER() OVER (ORDER BY Y, X) RN
  FROM Props
)
SELECT 
  CurrentRow.Prop1, 
  CurrentRow.Prop2, 
  CurrentRow.Prop3, 
  CurrentRow.Prop4, 
  CurrentRow.RN - 
    ISNULL((SELECT TOP 1 RN FROM Ordering O3 WHERE RN < CurrentRow.RN AND (CurrentRow.Prop1 <> O3.Prop1 OR CurrentRow.Prop2 <> O3.Prop2 OR CurrentRow.Prop3 <> O3.Prop3 OR CurrentRow.Prop4 <> O3.Prop4) ORDER BY RN DESC), 0) Repetitions
FROM Ordering CurrentRow
LEFT JOIN Ordering O2 ON CurrentRow.RN + 1 = O2.RN
WHERE O2.RN IS NULL OR (CurrentRow.Prop1 <> O2.Prop1 OR CurrentRow.Prop2 <> O2.Prop2 OR CurrentRow.Prop3 <> O2.Prop3 OR CurrentRow.Prop4 <> O2.Prop4) 
ORDER BY CurrentRow.RN

The gist is the following:

  1. Enumerate each row using ROW_NUMBER OVER to get the correct order.
  2. Find the maximums per cycle by joining only when the next row has different fields or when the next row does not exist.
  3. Figure out the count of repetitions is by taking the current row number (presumed to be the max for this cycle) and subtracting from it the maximum row number of the previous cycle, if it exists.
镜花水月 2024-08-10 21:39:09

70,000 行(每列四个整数)对于现代 LAN 上的带宽来说并不是真正的问题,除非您有许多工作站同时执行此查询;在带宽更受限制的 WAN 上,您可以使用 DISTINCT 来消除重复行,这种方法可以节省带宽,但会消耗一些服务器 CPU。然而,除非您有一台真正超载的服务器,并且始终以峰值负载或接近峰值负载运行,否则这种额外的消耗只是一个短暂的现象。 70,000 行几乎没什么。

70,000 rows of four integer columns is not really a worry for bandwidth on a modern LAN, unless you have many workstations executing this query concurrently; and on a WAN with more restricted bandwidth you could use DISTINCT to eliminate duplicate rows, an approach which would be frugal with your bandwidth but consume some server CPU. Again, however, unless you have a really overloaded server that is always performing at or near peak loads, this additional consumption would be a mere blip. 70,000 rows is next to nothing.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文