在一个非常大的表中计算精确行数的最快方法?

发布于 2024-11-08 14:43:03 字数 510 浏览 2 评论 0 原文

我遇到过一些文章,指出当表有很多行和很多列时,SELECT COUNT(*) FROM TABLE_NAME 会很慢。

我有一个可能包含数十亿行的表 [它大约有 15 列]。有没有更好的方法来获取表的行数的精确计数?

在回答之前请考虑以下事项:

  • 我正在寻找数据库供应商 独立的解决方案。如果是的话就可以了 涵盖MySQLOracleMS SQL Server。 但如果确实没有数据库 供应商独立的解决方案然后我 将采用不同的解决方案 针对不同的数据库供应商。

  • 我无法使用任何其他外部工具 来做到这一点。我主要是想找一个 基于 SQL 的解决方案。

  • 我无法标准化我的数据库设计 任何进一步。它已经在 3NF 中,而且是 很多代码已经写好了 围绕它。

I have come across articles that state that SELECT COUNT(*) FROM TABLE_NAME will be slow when the table has lots of rows and lots of columns.

I have a table that might contain even billions of rows [it has approximately 15 columns]. Is there a better way to get the EXACT count of the number of rows of a table?

Please consider the following before your answer:

  • I am looking for a database vendor
    independent solution. It is OK if it
    covers MySQL, Oracle, MS SQL Server.
    But if there is really no database
    vendor independent solution then I
    will settle for different solutions
    for different database vendors.

  • I cannot use any other external tool
    to do this. I am mainly looking for a
    SQL based solution.

  • I cannot normalize my database design
    any further. It is already in 3NF and moreover a
    lot of code has already been written
    around it.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(27

半山落雨半山空 2024-11-15 14:43:03

简单答案:

  • 数据库供应商独立解决方案 = 使用标准 = COUNT(*)
  • 近似 SQL Server 解决方案,但不使用 COUNT( *) = 超出范围

注释:

COUNT(1) = COUNT(*) = COUNT(PrimaryKey) 以防万一

编辑:

SQL Server 示例(14 亿行,12 列)

SELECT COUNT(*) FROM MyBigtable WITH (NOLOCK)
-- NOLOCK here is for me only to let me test for this answer: no more, no less

1 次运行,5:46 分钟,计数 = 1,401,659,700

--Note, sp_spaceused uses this DMV
SELECT
   Total_Rows= SUM(st.row_count)
FROM
   sys.dm_db_partition_stats st
WHERE
    object_name(object_id) = 'MyBigtable' AND (index_id < 2)

2 次运行,均在 1 秒内,计数 = 1,401,659,670

第二次运行的行数较少 = 错误。根据写入的情况,会相同或更多(此处的删除是在工作时间外完成的)

Simple answer:

  • Database vendor independent solution = use the standard = COUNT(*)
  • There are approximate SQL Server solutions but don't use COUNT(*) = out of scope

Notes:

COUNT(1) = COUNT(*) = COUNT(PrimaryKey) just in case

Edit:

SQL Server example (1.4 billion rows, 12 columns)

SELECT COUNT(*) FROM MyBigtable WITH (NOLOCK)
-- NOLOCK here is for me only to let me test for this answer: no more, no less

1 runs, 5:46 minutes, count = 1,401,659,700

--Note, sp_spaceused uses this DMV
SELECT
   Total_Rows= SUM(st.row_count)
FROM
   sys.dm_db_partition_stats st
WHERE
    object_name(object_id) = 'MyBigtable' AND (index_id < 2)

2 runs, both under 1 second, count = 1,401,659,670

The second one has less rows = wrong. Would be the same or more depending on writes (deletes are done out of hours here)

旧时浪漫 2024-11-15 14:43:03

迄今为止 MySQL 上最快的方法是:

SHOW TABLE STATUS;

您将立即获得所有表的行数(即总数)以及大量额外信息(如果需要)。

The fastest way by far on MySQL is:

SHOW TABLE STATUS;

You will instantly get all your tables with the row count (which is the total) along with plenty of extra information if you want.

伴梦长久 2024-11-15 14:43:03

我从另一个 StackOverflow 问题/答案中得到了这个脚本:

SELECT SUM(p.rows) FROM sys.partitions AS p
  INNER JOIN sys.tables AS t
  ON p.[object_id] = t.[object_id]
  INNER JOIN sys.schemas AS s
  ON s.[schema_id] = t.[schema_id]
  WHERE t.name = N'YourTableNameHere'
  AND s.name = N'dbo'
  AND p.index_id IN (0,1);

我的表有 5 亿条记录,上面的返回时间不到 1 毫秒。
与此同时,

SELECT COUNT(id) FROM MyTable

足足花了39分52秒!

它们产生完全相同的行数(在我的例子中,恰好是 519326012)。

我不知道情况是否会一直如此。

I got this script from another StackOverflow question/answer:

SELECT SUM(p.rows) FROM sys.partitions AS p
  INNER JOIN sys.tables AS t
  ON p.[object_id] = t.[object_id]
  INNER JOIN sys.schemas AS s
  ON s.[schema_id] = t.[schema_id]
  WHERE t.name = N'YourTableNameHere'
  AND s.name = N'dbo'
  AND p.index_id IN (0,1);

My table has 500 million records and the above returns in less than 1ms.
Meanwhile,

SELECT COUNT(id) FROM MyTable

takes a full 39 minutes, 52 seconds!

They yield the exact same number of rows (in my case, exactly 519326012).

I do not know if that would always be the case.

生来就爱笑 2024-11-15 14:43:03

我发现这篇好文章 SQL Server – 操作方法:从 martijnh1 快速检索表的准确行数,它对每个场景进行了很好的回顾。

我需要对此进行扩展,我需要根据特定条件提供计数,当我计算出这部分时,我将进一步更新此答案。

同时,以下是文章中的详细信息:

方法 1:

查询:

SELECT COUNT(*) FROM Transactions 

注释:

执行全表扫描。在大桌子上速度很慢。

方法 2:

查询:

SELECT CONVERT(bigint, rows) 
FROM sysindexes 
WHERE id = OBJECT_ID('Transactions') 
AND indid < 2 

注释:

检索行数的快速方法。取决于统计数据并且不准确。

运行 DBCC UPDATEUSAGE(Database) WITH COUNT_ROWS,这对于大型表可能会花费大量时间。

方法 3:

查询:

SELECT CAST(p.rows AS float) 
FROM sys.tables AS tbl 
INNER JOIN sys.indexes AS idx ON idx.object_id = tbl.object_id and
idx.index_id < 2 
INNER JOIN sys.partitions AS p ON p.object_id=CAST(tbl.object_id AS int) 
AND p.index_id=idx.index_id 
WHERE ((tbl.name=N'Transactions' 
AND SCHEMA_NAME(tbl.schema_id)='dbo')) 

注释:

SQL Management Studio 计算行数的方式(查看表属性、存储、行数)。非常快,但仍然是近似的行数。

方法4:

查询:

SELECT SUM (row_count) 
FROM sys.dm_db_partition_stats 
WHERE object_id=OBJECT_ID('Transactions')    
AND (index_id=0 or index_id=1); 

评论:

操作快速(虽然不如方法2那么快),并且同样重要的是可靠。

I found this good article SQL Server–HOW-TO: quickly retrieve accurate row count for table from martijnh1 which gives a good recap for each scenarios.

I need this to be expanded where I need to provide a count based on a specific condition and when I figure this part, I'll update this answer further.

In the meantime, here are the details from article:

Method 1:

Query:

SELECT COUNT(*) FROM Transactions 

Comments:

Performs a full table scan. Slow on large tables.

Method 2:

Query:

SELECT CONVERT(bigint, rows) 
FROM sysindexes 
WHERE id = OBJECT_ID('Transactions') 
AND indid < 2 

Comments:

Fast way to retrieve row count. Depends on statistics and is inaccurate.

Run DBCC UPDATEUSAGE(Database) WITH COUNT_ROWS, which can take significant time for large tables.

Method 3:

Query:

SELECT CAST(p.rows AS float) 
FROM sys.tables AS tbl 
INNER JOIN sys.indexes AS idx ON idx.object_id = tbl.object_id and
idx.index_id < 2 
INNER JOIN sys.partitions AS p ON p.object_id=CAST(tbl.object_id AS int) 
AND p.index_id=idx.index_id 
WHERE ((tbl.name=N'Transactions' 
AND SCHEMA_NAME(tbl.schema_id)='dbo')) 

Comments:

The way the SQL management studio counts rows (look at table properties, storage, row count). Very fast, but still an approximate number of rows.

Method 4:

Query:

SELECT SUM (row_count) 
FROM sys.dm_db_partition_stats 
WHERE object_id=OBJECT_ID('Transactions')    
AND (index_id=0 or index_id=1); 

Comments:

Quick (although not as fast as method 2) operation and equally important, reliable.

审判长 2024-11-15 14:43:03

您可以尝试这个sp_spaceused (Transact-SQL)

显示行数、磁盘数
保留的空间和使用的磁盘空间
表、索引视图或服务
当前数据库中的代理队列,
或显示保留的磁盘空间
并被整个数据库使用。

You can try this sp_spaceused (Transact-SQL)

Displays the number of rows, disk
space reserved, and disk space used by
a table, indexed view, or Service
Broker queue in the current database,
or displays the disk space reserved
and used by the whole database.

淡笑忘祈一世凡恋 2024-11-15 14:43:03

我遇到过一些文章,指出当表有大量行和列时,SELECT COUNT(*) FROM TABLE_NAME 会很慢。

这取决于数据库。有些可以加速计数,例如通过跟踪索引中的行是活动的还是死的,允许仅扫描索引来提取行数。其他人则不需要,因此需要访问整个表并一一计算活动行。对于一张大桌子来说,两者都会很慢。

请注意,您通常可以通过使用查询优化工具、表统计信息等来提取良好的估计。例如,对于 PostgreSQL,您可以解析 explain count(*) from yourtable 的输出,并获得对行数的相当好的估计。这让我想到你的第二个问题。

我有一个可能包含数十亿行的表[它大约有 15 列]。有没有更好的方法来获取表行数的精确计数?

严重地? :-) 您真​​的是指具有数十亿行的表中的精确计数吗?你真的确定吗? :-)

如果您确实这样做,您可以使用触发器来跟踪总数,但如果您这样做,请注意并发和死锁。

I have come across articles that state that SELECT COUNT(*) FROM TABLE_NAME will be slow when the table has lots of rows and lots of columns.

That depends on the database. Some speed up counts, for instance by keeping track of whether rows are live or dead in the index, allowing for an index only scan to extract the number of rows. Others do not, and consequently require visiting the whole table and counting live rows one by one. Either will be slow for a huge table.

Note that you can generally extract a good estimate by using query optimization tools, table statistics, etc. In the case of PostgreSQL, for instance, you could parse the output of explain count(*) from yourtable and get a reasonably good estimate of the number of rows. Which brings me to your second question.

I have a table that might contain even billions of rows [it has approximately 15 columns]. Is there a better way to get the EXACT count of the number of rows of a table?

Seriously? :-) You really mean the exact count from a table with billions of rows? Are you really sure? :-)

If you really do, you could keep a trace of the total using triggers, but mind concurrency and deadlocks if you do.

蒲公英的约定 2024-11-15 14:43:03

是否有更好的方法来获取表行数的精确计数?

简单地回答您的问题,

如果您需要一种独立于 DBMS 的方法来执行此操作,最快的方法始终是:

SELECT COUNT(*) FROM TableName

某些 DBMS 供应商可能有更快的方法,但仅适用于他们的系统。其中一些选项已经发布在其他答案中。

无论如何,COUNT(*) 应该由 DBMS(至少是任何值得 PROD 的 DB)进行优化,因此不要尝试绕过它们的优化。

附注:
我确信由于您的表大小,您的许多其他查询也需要很长时间才能完成。任何性能问题都应该通过考虑速度的模式设计来解决。我知道您说过这不是一个可以更改的选项,但结果可能是 10 分钟以上的查询也不是一个选项。当您需要速度时,第三个 NF 并不总是最好的方法,有时,如果记录不必须存储在一起,则可以将数据分区到多个表中。需要思考的事情...

Is there a better way to get the EXACT count of the number of rows of a table?

To answer your question simply, No.

If you need a DBMS independent way of doing this, the fastest way will always be:

SELECT COUNT(*) FROM TableName

Some DBMS vendors may have quicker ways which will work for their systems only. Some of these options are already posted in other answers.

COUNT(*) should be optimized by the DBMS (at least any PROD worthy DB) anyway, so don't try to bypass their optimizations.

On a side note:
I am sure many of your other queries also take a long time to finish because of your table size. Any performance concerns should probably be addressed by thinking about your schema design with speed in mind. I realize you said that it is not an option to change but it might turn out that 10+ minute queries aren't an option either. 3rd NF is not always the best approach when you need speed, and sometimes data can be partitioned in several tables if the records don't have to be stored together. Something to think about...

浪菊怪哟 2024-11-15 14:43:03

如果 SQL Server 版本是 2005/2008,则可以使用 DMV 来计算表中的行数:

-- Shows all user tables and row counts for the current database 
-- Remove is_ms_shipped = 0 check to include system objects 
-- i.index_id < 2 indicates clustered index (1) or hash table (0) 
SELECT o.name, 
 ddps.row_count 
FROM sys.indexes AS i 
 INNER JOIN sys.objects AS o ON i.OBJECT_ID = o.OBJECT_ID 
 INNER JOIN sys.dm_db_partition_stats AS ddps ON i.OBJECT_ID = ddps.OBJECT_ID 
 AND i.index_id = ddps.index_id 
WHERE i.index_id < 2 
 AND o.is_ms_shipped = 0 
ORDER BY o.NAME 

对于 SQL Server 2000 数据库引擎,sysindexes 可以使用,但强烈建议避免在 SQL Server 的未来版本中使用它,因为它可能会在不久的将来被删除。

示例代码取自:如何快速轻松地获取表行计数

If SQL Server edition is 2005/2008, you can use DMVs to calculate the row count in a table:

-- Shows all user tables and row counts for the current database 
-- Remove is_ms_shipped = 0 check to include system objects 
-- i.index_id < 2 indicates clustered index (1) or hash table (0) 
SELECT o.name, 
 ddps.row_count 
FROM sys.indexes AS i 
 INNER JOIN sys.objects AS o ON i.OBJECT_ID = o.OBJECT_ID 
 INNER JOIN sys.dm_db_partition_stats AS ddps ON i.OBJECT_ID = ddps.OBJECT_ID 
 AND i.index_id = ddps.index_id 
WHERE i.index_id < 2 
 AND o.is_ms_shipped = 0 
ORDER BY o.NAME 

For SQL Server 2000 database engine, sysindexes will work, but it is strongly advised to avoid using it in future editions of SQL Server as it may be removed in the near future.

Sample code taken from: How To Get Table Row Counts Quickly And Painlessly

只有一腔孤勇 2024-11-15 14:43:03

我用

select /*+ parallel(a) */  count(1) from table_name a;

I use

select /*+ parallel(a) */  count(1) from table_name a;
半寸时光 2024-11-15 14:43:03

我远不如其他回答过的专家那么专业,但我在使用从表中选择随机行(不太相关)的过程时遇到了问题,但我需要知道参考表中的行数计算随机指数。使用传统的 Count(*) 或 Count(1) 工作,但我偶尔会需要长达 2 秒的时间来运行查询。因此(对于我的名为“tbl_HighOrder”的表),我使用的是:

Declare @max int

Select @max = Row_Count
From sys.dm_db_partition_stats
Where Object_Name(Object_Id) = 'tbl_HighOrder'

它工作得很好,并且 Management Studio 中的查询时间为零。

I'm nowhere near as expert as others who have answered but I was having an issue with a procedure I was using to select a random row from a table (not overly relevant) but I needed to know the number of rows in my reference table to calculate the random index. Using the traditional Count(*) or Count(1) work but I was occasionally getting up to 2 seconds for my query to run. So instead (for my table named 'tbl_HighOrder') I am using:

Declare @max int

Select @max = Row_Count
From sys.dm_db_partition_stats
Where Object_Name(Object_Id) = 'tbl_HighOrder'

It works great and query times in Management Studio are zero.

情绪操控生活 2024-11-15 14:43:03

好吧,晚了 5 年,不确定这是否有帮助:

我正在尝试数数。使用 MS SQL Server Management Studio 的 SQL Server 表中的行数并遇到一些溢出错误,然后我使用以下命令:

select count_big(1) FROM [dbname]。 [dbo].[FactSampleValue];

结果:

24296650578 行

Well, late by 5 years and unsure if it helps :

I was trying to count the no. of rows in a SQL Server table using MS SQL Server Management Studio and ran into some overflow error, then I used the below :

select count_big(1) FROM [dbname].[dbo].[FactSampleValue];

The result :

24296650578 rows

掌心的温暖 2024-11-15 14:43:03

不完全是一种与 DBMS 无关的解决方案,但至少您的客户端代码不会看到差异...

创建另一个仅包含一行和一个整数字段 N1 的表 T,并创建 INSERT TRIGGER只是执行:

UPDATE T SET N = N + 1

同时创建一个 DELETE TRIGGER 执行:

UPDATE T SET N = N - 1

一个称职的 DBMS 将保证上述操作的原子性2,并且 N 将始终包含准确的行数,这是超级的- 简单快速:

SELECT N FROM T

虽然触发器是特定于 DBMS 的,但从 T 中进行选择却不是,并且您的客户端代码不需要针对每个受支持的 DBMS 进行更改。

但是,如果表是 INSERT 或 DELETE 密集型的,这可能会产生一些可伸缩性问题,特别是如果您在 INSERT/DELETE 后不立即 COMMIT。


1 这些名称只是占位符 - 在生产中使用更有意义的名称。

2 即 N 不能通过读取之间的并发事务进行更改写入N,只要读取和写入都在单个SQL语句中完成即可。

Not exactly a DBMS-agnostic solution, but at least your client code won't see the difference...

Create another table T with just one row and one integer field N1, and create INSERT TRIGGER that just executes:

UPDATE T SET N = N + 1

Also create a DELETE TRIGGER that executes:

UPDATE T SET N = N - 1

A DBMS worth its salt will guarantee the atomicity of the operations above2, and N will contain the accurate count of rows at all times, which is then super-quick to get by simply:

SELECT N FROM T

While triggers are DBMS-specific, selecting from T isn't and your client code won't need to change for each supported DBMS.

However, this can have some scalability issues if the table is INSERT or DELETE-intensive, especially if you don't COMMIT immediately after INSERT/DELETE.


1 These names are just placeholders - use something more meaningful in production.

2 I.e. N cannot be changed by a concurrent transaction between reading and writing to N, as long as both reading and writing are done in a single SQL statement.

雨后彩虹 2024-11-15 14:43:03

我认为不存在一个通用的始终最快的解决方案:某些 RDBMS/版本对 SELECT COUNT(*) 有特定的优化,使用更快的选项,而其他版本则只是进行表扫描。您需要访问第二组的文档/支持站点,这可能需要编写一些更具体的查询,通常是以某种方式命中索引的查询。

编辑:

这里有一个可能可行的想法,具体取决于您的架构和数据分布:您是否有一个索引列引用递增的值、数字递增的 ID,甚至是时间戳或日期?然后,假设没有发生删除,应该可以将计数存储到某个最近的值(昨天的日期、最近某个采样点的最高 ID 值),并添加超出该值的计数,这应该会在索引中很快解析。当然,非常依赖于值和索引,但几乎适用于任何 DBMS 的任何版本。

I don't think there is a general always-fastest solution: some RDBMS/versions have a specific optimization for SELECT COUNT(*) that use faster options while others simply table-scan. You'd need to go to the documentation/support sites for the second set, which will probably need some more specific query to be written, usually one that hits an index in some way.

EDIT:

Here's a thought that might work, depending on your schema and distribution of data: do you have an indexed column that references an increasing value, a numeric increasing ID, say, or even a timestamp or date? Then, assuming deletes don't happen, it should be possible to store the count up to some recent value (yesterday's date, highest ID value at some recent sample point) and add the count beyond that, which should resolve very quickly in the index. Very dependent on values and indices, of course, but applicable to pretty much any version of any DBMS.

烟若柳尘 2024-11-15 14:43:03

我迟到了这个问题,但这是你可以用 MySQL 做的事情(因为我使用 MySQL)。我在这里分享我的观察结果:

1) SELECT COUNT(*) AS TOTAL_ROWS FROM <TABLE_NAME>

结果
行数:508534
控制台输出:受影响的行:0 找到的行:1 警告:0 1 个查询的持续时间:0.125 秒。
对于行数较多的表来说需要一段时间,但行数非常精确。

2) SHOW TABLE STATUS or SHOW TABLE STATUS WHERE NAME="<TABLE_NAME>"

结果
行数:511235
控制台输出:受影响的行:0 找到的行:1 警告:0 1 个查询的持续时间:0.250 秒
摘要:行数不准确。

3) SELECT * FROM information_schema.tables WHERE table_schema = DATABASE();

结果
行数:507806
控制台输出:受影响的行:0 找到的行:48 警告:0 1 个查询的持续时间:1.701 秒。
行计数不准确。

我不是 MySQL 或数据库专家,但我发现对于非常大的表,您可以使用选项 2 或 3 并“合理地了解”存在多少行。

我需要获取这些行数以便在用户界面上显示一些统计信息。通过上述查询,我​​知道总行数超过 500,000 行,因此我想出了显示“超过 500,000 行”之类的统计信息,但没有显示确切的行数。

也许我还没有真正回答OP的问题,但我正在分享我在需要此类统计数据的情况下所做的事情。就我而言,显示近似行是可以接受的,因此上述内容对我有用。

I am late to this question, but here is what you can do with MySQL (as I use MySQL). I am sharing my observations here:

1) SELECT COUNT(*) AS TOTAL_ROWS FROM <TABLE_NAME>

Result
Row Count: 508534
Console output: Affected rows: 0 Found rows: 1 Warnings: 0 Duration for 1 query: 0.125 sec.
Takes a while for a table with large number of rows, but the row count is very exact.

2) SHOW TABLE STATUS or SHOW TABLE STATUS WHERE NAME="<TABLE_NAME>"

Result
Row count: 511235
Console output: Affected rows: 0 Found rows: 1 Warnings: 0 Duration for 1 query: 0.250 sec
Summary: Row count is not exact.

3) SELECT * FROM information_schema.tables WHERE table_schema = DATABASE();

Result
Row count: 507806
Console output: Affected rows: 0 Found rows: 48 Warnings: 0 Duration for 1 query: 1.701 sec.
Row count is not exact.

I am not a MySQL or database expert, but I have found that for very large tables, you can use option 2 or 3 and get a 'fair idea' of how many rows are present.

I needed to get these row counts for displaying some stats on the UI. With the above queries, I knew that the total rows were more than 500,000, so I came up with showing stats like "More than 500,000 rows" without showing exact number of rows.

Maybe I have not really answered the OP's question, but I am sharing what I did in a situation where such statistics were needed. In my case, showing the approximate rows was acceptable and so the above worked for me.

冰火雁神 2024-11-15 14:43:03

对于我来说,在一个非常大的表中,

SELECT COUNT(1) FROM TableLarge 

需要 37 秒,而

SELECT COUNT_BIG(1) FROM TableLarge

需要 4 秒。

In a very large table for me,

SELECT COUNT(1) FROM TableLarge 

takes 37 seconds whereas

SELECT COUNT_BIG(1) FROM TableLarge

takes 4 seconds.

零度° 2024-11-15 14:43:03

这确实是一个疯狂的答案,但是如果您设置了某种复制系统(对于具有十亿行的系统,我希望您这样做),您可以使用粗略估计器(例如 MAX(pk) ) >),将该值除以您拥有的从站数量,并行运行多个查询。

在大多数情况下,您可以根据最佳键(或我猜的主键)对从属设备之间的查询进行分区,以这种方式(我们将使用 250000000 作为我们的行/从属设备):

-- First slave
SELECT COUNT(pk) FROM t WHERE pk < 250000000
-- Ith slave where 2 <= I <= N - 1
SELECT COUNT(pk) FROM t WHERE pk >= I*250000000 and pk < (I+1)*250000000
-- Last slave
SELECT COUNT(pk) FROM t WHERE pk > (N-1)*250000000

但是您需要 SQL仅有的。真是个半身像。好吧,假设你是一个施虐受虐狂。
在主服务器(或最近的从服务器)上,您很可能需要为此创建一个表:

CREATE TABLE counter_table (minpk integer, maxpk integer, cnt integer, slaveid integer)

因此,您必须执行插入操作,而不是仅在从服务器中运行选择,类似于:

INSERT INTO counter_table VALUES (I*25000000, (I+1)*250000000, (SELECT COUNT(pk) FROM ... ), @@SLAVE_ID)

您可能会遇到从服务器写入主服务器上的表的问题。您可能需要获得更多的 sadis - 我的意思是,创造性:

-- A table per slave!
INSERT INTO counter_table_slave_I VALUES (...)

您最终应该有一个相对于第一个从属设备存在于复制图遍历的路径中最后的从属设备。该从属设备现在应该具有所有其他计数器值,并且应该具有其自己的值。但是当您完成时,可能会添加行,因此您必须插入另一行来补偿 counter_table 中记录的最大 pk 和当前的最大 pk。

那时,您必须执行聚合函数来计算总行数,但这更容易,因为您最多只能在“您拥有和更改的从属数量”行上运行它。

如果您的从属表中有单独的表,则可以通过 UNION 来获取所需的所有行。

SELECT SUM(cnt) FROM (
    SELECT * FROM counter_table_slave_1
      UNION
    SELECT * FROM counter_table_slave_2
      UNION
    ...
  )

或者你知道,不要那么疯狂,将数据迁移到分布式处理系统,或者使用数据仓库解决方案(这也将在未来为你提供出色的数据处理)。

请注意,这确实取决于您的复制设置得如何。由于主要瓶颈很可能是持久存储,因此如果您的存储不完整或隔离不良且邻居噪音很大的数据存储,这可能会比仅仅等待单个 SELECT COUNT(*) ...< /code>

但如果你有良好的复制,那么你的速度增益应该与从站的数量直接相关。事实上,如果单独运行计数查询需要 10 分钟,并且您有 8 个从属设备,那么您可以将时间缩短到不到几分钟。也许需要一个小时来解决这个解决方案的细节。

当然,您永远不会真正得到非常准确的答案,因为这种分布式解决方案引入了可以删除和插入行的一些时间,但您可以尝试在同一实例中获取行的分布式锁并获得精确的计数特定时刻表中的行数。

实际上,这似乎是不可能的,因为您基本上只能使用仅 SQL 的解决方案,并且我不认为您提供了一种机制来立即跨多个从站运行分片和锁定查询。也许如果您可以控制复制日志文件...这意味着您实际上会为此目的而旋转从属设备,这无疑比在单台计算机上运行计数查询要慢。

这是我的两枚 2013 年便士。

A literally insane answer, but if you have some kind of replication system set up (for a system with a billion rows, I hope you do), you can use a rough-estimator (like MAX(pk)), divide that value by the number of slaves you have, run several queries in parallel.

For the most part, you'd partition the queries across slaves based on the best key (or the primary key I guess), in such a way (we're going to use 250000000 as our Rows / Slaves):

-- First slave
SELECT COUNT(pk) FROM t WHERE pk < 250000000
-- Ith slave where 2 <= I <= N - 1
SELECT COUNT(pk) FROM t WHERE pk >= I*250000000 and pk < (I+1)*250000000
-- Last slave
SELECT COUNT(pk) FROM t WHERE pk > (N-1)*250000000

But you need SQL only. What a bust. Ok, so let's say you're a sadomasochist.
On the master (or closest slave) you'd most likely need to create a table for this:

CREATE TABLE counter_table (minpk integer, maxpk integer, cnt integer, slaveid integer)

So instead of only having the selects running in your slaves, you'd have to do an insert, akin to this:

INSERT INTO counter_table VALUES (I*25000000, (I+1)*250000000, (SELECT COUNT(pk) FROM ... ), @@SLAVE_ID)

You may run into issues with slaves writing to a table on master. You may need to get even more sadis- I mean, creative:

-- A table per slave!
INSERT INTO counter_table_slave_I VALUES (...)

You should in the end have a slave that exists last in the path traversed by the replication graph, relative to the first slave. That slave should now have all other counter values, and should have its own values. But by the time you've finished, there probably are rows added, so you'd have to insert another one compensating for the recorded max pk in your counter_table and the current max pk.

At that point, you'd have to do an aggregate function to figure out what the total rows are, but that's easier since you'd be running it on at most the "number of slaves you have and change" rows.

If you're in the situation where you have separate tables in the slaves, you can UNION to get all the rows you need.

SELECT SUM(cnt) FROM (
    SELECT * FROM counter_table_slave_1
      UNION
    SELECT * FROM counter_table_slave_2
      UNION
    ...
  )

Or you know, be a bit less insane and migrate your data to a distributed processing system, or maybe use a Data Warehousing solution (which will give you awesome data crunching in the future too).

Do note, this does depend on how well your replication is set up. Since the primary bottleneck will most likely be persistent storage, if you have cruddy storage or poorly segregated data stores with heavy neighbor noise, this will probably run you slower than just waiting for a single SELECT COUNT(*) ...

But if you have good replication, then your speed gains should be directly related to the number or slaves. In fact, if it takes 10 minutes to run the counting query alone, and you have 8 slaves, you'd cut your time to less than a couple minutes. Maybe an hour to iron out the details of this solution.

Of course, you'd never really get an amazingly accurate answer since this distributed solving introduces a bit of time where rows can be deleted and inserted, but you can try to get a distributed lock of rows at the same instance and get a precise count of the rows in the table for a particular moment in time.

Actually, this seems impossible, since you're basically stuck with an SQL-only solution, and I don't think you're provided a mechanism to run a sharded and locked query across multiple slaves, instantly. Maybe if you had control of the replication log file... which means you'd literally be spinning up slaves for this purpose, which is no doubt slower than just running the count query on a single machine anyway.

So there's my two 2013 pennies.

花辞树 2024-11-15 14:43:03

如果插入触发器使用起来太昂贵,但是可以负担删除触发器,并且有一个自动递增id< /strong>,然后对整个表格进行一次计数,并记住计数为 last-countlast-counted-id

然后每天 只需要计算 id > last-counted-id,将其添加到 last-count,并存储新的 last-counted-id

如果已删除记录的 id <= Last-counted-id,则删除触发器将递减 Last-Count。

If insert trigger is too expensive to use, but a delete trigger could be afforded, and there is an auto-increment id, then after counting entire table once, and remembering the count as last-count and the last-counted-id,

then each day just need to count for id > last-counted-id, add that to last-count, and store the new last-counted-id.

The delete trigger would decrement last-count, if id of deleted record <= last-counted-id.

断肠人 2024-11-15 14:43:03

如果您有一个典型的表结构,具有自动递增主键列,其中的行永远不会被删除,则以下将是确定记录计数的最快方法,并且应该在大多数 ANSI 兼容数据库中以类似方式工作:

SELECT TOP(1) <primarykeyfield> FROM <table> ORDER BY <primarykeyfield> DESC;

我使用 MS SQL 表包含数十亿行,需要亚秒级的数据响应时间,包括记录计数。相比之下,类似的 SELECT COUNT(*) 将需要几分钟的时间来处理。

If you have a typical table structure with an auto-incrementing primary key column in which rows are never deleted, the following will be the fastest way to determine the record count and should work similarly across most ANSI compliant databases:

SELECT TOP(1) <primarykeyfield> FROM <table> ORDER BY <primarykeyfield> DESC;

I work with MS SQL tables containing billions of rows that require sub-second response times for data, including record counts. A similar SELECT COUNT(*) would take minutes to process by comparison.

瘫痪情歌 2024-11-15 14:43:03

对于 Sql 服务器尝试这个

SELECT T.name, 
       I.rows AS [ROWCOUNT] 
FROM   sys.tables AS T 
       INNER JOIN sys.sysindexes AS I 
               ON T.object_id = I.id AND I.indid < 2 
WHERE T.name = 'Your_Table_Name'
ORDER  BY I.rows DESC 

For Sql server try this

SELECT T.name, 
       I.rows AS [ROWCOUNT] 
FROM   sys.tables AS T 
       INNER JOIN sys.sysindexes AS I 
               ON T.object_id = I.id AND I.indid < 2 
WHERE T.name = 'Your_Table_Name'
ORDER  BY I.rows DESC 
回忆躺在深渊里 2024-11-15 14:43:03
select rows from sysindexes
where id = Object_ID('TableName') and indid <2
select rows from sysindexes
where id = Object_ID('TableName') and indid <2
翻身的咸鱼 2024-11-15 14:43:03

使用 COUNT_BIG() 获取非常大的文件中的记录数。

SELECT COUNT_BIG(*) FROM TABLENAME;

Use COUNT_BIG() for fetching the count of records in a very large sized file.

SELECT COUNT_BIG(*) FROM TABLENAME;
ぃ双果 2024-11-15 14:43:03

在某些列上放置索引。这应该允许优化器执行索引块的完整扫描,而不是表的完整扫描。这将大大降低您的 IO 成本。看看前后的执行计划。然后双向测量挂钟时间。

Put an index on some column. That should allow the optimizer to perform a full scan of the index blocks, instead of a full scan of the table. That will cut your IO costs way down. Look at the execution plan before and after. Then measure wall clock time both ways.

森林散布 2024-11-15 14:43:03

如果您使用的是 Oracle,那么这样如何(假设表统计信息已更新):

select <TABLE_NAME>, num_rows, last_analyzed from user_tables

last_analyzed 将显示上次收集统计信息的时间。

If you are using Oracle, how about this (assuming the table stats are updated):

select <TABLE_NAME>, num_rows, last_analyzed from user_tables

last_analyzed will show the time when stats were last gathered.

半夏半凉 2024-11-15 14:43:03

使用 PostgreSQL:

SELECT reltuples AS approximate_row_count FROM pg_class WHERE relname = 'table_name'

With PostgreSQL:

SELECT reltuples AS approximate_row_count FROM pg_class WHERE relname = 'table_name'
深海夜未眠 2024-11-15 14:43:03

在 SQL Server 2019 中,您可以使用 APPROX_COUNT_DISTINCT,其中:

返回组中唯一非空值的大致数量

从文档中

APPROX_COUNT_DISTINCT 专为大数据场景而设计,
针对以下条件进行了优化:

  • 访问数百万行或更多的数据集
  • 具有许多不同值的一列或多列的聚合

此外,该函数

  • 实现保证在 97% 的概率内错误率高达 2%,
  • 需要更少的内存由于内存占用较小,与精确的 COUNT DISTINCT 操作相比,详尽的
  • COUNT DISTINCT 操作不太可能将内存溢出到磁盘。

实现其背后的算法是HyperLogLog

With SQL Server 2019, you can use APPROX_COUNT_DISTINCT, which:

returns the approximate number of unique non-null values in a group

and from the docs:

APPROX_COUNT_DISTINCT is designed for use in big data scenarios and is
optimized for the following conditions:

  • Access of data sets that are millions of rows or higher and
  • Aggregation of a column or columns that have many distinct values

Also, the function

  • implementation guarantees up to a 2% error rate within a 97% probability
  • requires less memory than an exhaustive COUNT DISTINCT operation
  • given the smaller memory footprint is less likely to spill memory to disk compared to a precise COUNT DISTINCT operation.

The algorithm behind the implementation its HyperLogLog.

厌倦 2024-11-15 14:43:03

在 SQL Server 2016 中,我只需检查表属性,然后选择“存储”选项卡 - 这将为我提供行数、表使用的磁盘空间、使用的索引空间等。

In SQL server 2016, I can just check table properties and then select 'Storage' tab - this gives me row count, disk space used by the table, index space used etc.

無處可尋 2024-11-15 14:43:03

也许有点晚了,但这可能会帮助其他人使用 MSSQL

;WITH RecordCount AS (  SELECT      ROW_NUMBER() OVER (ORDER BY
COLUMN_NAME) AS [RowNumber]     FROM        TABLE_NAME )  SELECT
MAX(RowNumber) FROM RecordCount

Maybe a bit late but this might help others for MSSQL

;WITH RecordCount AS (  SELECT      ROW_NUMBER() OVER (ORDER BY
COLUMN_NAME) AS [RowNumber]     FROM        TABLE_NAME )  SELECT
MAX(RowNumber) FROM RecordCount
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文