我遇到过一些文章,指出当表有很多行和很多列时,SELECT COUNT(*) FROM TABLE_NAME
会很慢。
我有一个可能包含数十亿行的表 [它大约有 15 列]。有没有更好的方法来获取表的行数的精确计数?
在回答之前请考虑以下事项:
-
我正在寻找数据库供应商
独立的解决方案。如果是的话就可以了
涵盖MySQL、Oracle、MS SQL Server。
但如果确实没有数据库
供应商独立的解决方案然后我
将采用不同的解决方案
针对不同的数据库供应商。
-
我无法使用任何其他外部工具
来做到这一点。我主要是想找一个
基于 SQL 的解决方案。
-
我无法标准化我的数据库设计
任何进一步。它已经在 3NF 中,而且是
很多代码已经写好了
围绕它。
I have come across articles that state that SELECT COUNT(*) FROM TABLE_NAME
will be slow when the table has lots of rows and lots of columns.
I have a table that might contain even billions of rows [it has approximately 15 columns]. Is there a better way to get the EXACT count of the number of rows of a table?
Please consider the following before your answer:
-
I am looking for a database vendor
independent solution. It is OK if it
covers MySQL, Oracle, MS SQL Server.
But if there is really no database
vendor independent solution then I
will settle for different solutions
for different database vendors.
-
I cannot use any other external tool
to do this. I am mainly looking for a
SQL based solution.
-
I cannot normalize my database design
any further. It is already in 3NF and moreover a
lot of code has already been written
around it.
发布评论
评论(27)
简单答案:
COUNT(*)
注释:
COUNT(1) = COUNT(*) = COUNT(PrimaryKey) 以防万一
编辑:
SQL Server 示例(14 亿行,12 列)
1 次运行,5:46 分钟,计数 = 1,401,659,700
2 次运行,均在 1 秒内,计数 = 1,401,659,670
第二次运行的行数较少 = 错误。根据写入的情况,会相同或更多(此处的删除是在工作时间外完成的)
Simple answer:
COUNT(*)
Notes:
COUNT(1) = COUNT(*) = COUNT(PrimaryKey) just in case
Edit:
SQL Server example (1.4 billion rows, 12 columns)
1 runs, 5:46 minutes, count = 1,401,659,700
2 runs, both under 1 second, count = 1,401,659,670
The second one has less rows = wrong. Would be the same or more depending on writes (deletes are done out of hours here)
迄今为止 MySQL 上最快的方法是:
您将立即获得所有表的行数(即总数)以及大量额外信息(如果需要)。
The fastest way by far on MySQL is:
You will instantly get all your tables with the row count (which is the total) along with plenty of extra information if you want.
我从另一个 StackOverflow 问题/答案中得到了这个脚本:
我的表有 5 亿条记录,上面的返回时间不到 1 毫秒。
与此同时,
足足花了39分52秒!
它们产生完全相同的行数(在我的例子中,恰好是 519326012)。
我不知道情况是否会一直如此。
I got this script from another StackOverflow question/answer:
My table has 500 million records and the above returns in less than 1ms.
Meanwhile,
takes a full 39 minutes, 52 seconds!
They yield the exact same number of rows (in my case, exactly 519326012).
I do not know if that would always be the case.
我发现这篇好文章 SQL Server – 操作方法:从
martijnh1
快速检索表的准确行数,它对每个场景进行了很好的回顾。我需要对此进行扩展,我需要根据特定条件提供计数,当我计算出这部分时,我将进一步更新此答案。
同时,以下是文章中的详细信息:
方法 1:
查询:
注释:
执行全表扫描。在大桌子上速度很慢。
方法 2:
查询:
注释:
检索行数的快速方法。取决于统计数据并且不准确。
运行 DBCC UPDATEUSAGE(Database) WITH COUNT_ROWS,这对于大型表可能会花费大量时间。
方法 3:
查询:
注释:
SQL Management Studio 计算行数的方式(查看表属性、存储、行数)。非常快,但仍然是近似的行数。
方法4:
查询:
评论:
操作快速(虽然不如方法2那么快),并且同样重要的是可靠。
I found this good article SQL Server–HOW-TO: quickly retrieve accurate row count for table from
martijnh1
which gives a good recap for each scenarios.I need this to be expanded where I need to provide a count based on a specific condition and when I figure this part, I'll update this answer further.
In the meantime, here are the details from article:
Method 1:
Query:
Comments:
Performs a full table scan. Slow on large tables.
Method 2:
Query:
Comments:
Fast way to retrieve row count. Depends on statistics and is inaccurate.
Run DBCC UPDATEUSAGE(Database) WITH COUNT_ROWS, which can take significant time for large tables.
Method 3:
Query:
Comments:
The way the SQL management studio counts rows (look at table properties, storage, row count). Very fast, but still an approximate number of rows.
Method 4:
Query:
Comments:
Quick (although not as fast as method 2) operation and equally important, reliable.
您可以尝试这个sp_spaceused (Transact-SQL)
You can try this sp_spaceused (Transact-SQL)
这取决于数据库。有些可以加速计数,例如通过跟踪索引中的行是活动的还是死的,允许仅扫描索引来提取行数。其他人则不需要,因此需要访问整个表并一一计算活动行。对于一张大桌子来说,两者都会很慢。
请注意,您通常可以通过使用查询优化工具、表统计信息等来提取良好的估计。例如,对于 PostgreSQL,您可以解析
explain count(*) from yourtable
的输出,并获得对行数的相当好的估计。这让我想到你的第二个问题。严重地? :-) 您真的是指具有数十亿行的表中的精确计数吗?你真的确定吗? :-)
如果您确实这样做,您可以使用触发器来跟踪总数,但如果您这样做,请注意并发和死锁。
That depends on the database. Some speed up counts, for instance by keeping track of whether rows are live or dead in the index, allowing for an index only scan to extract the number of rows. Others do not, and consequently require visiting the whole table and counting live rows one by one. Either will be slow for a huge table.
Note that you can generally extract a good estimate by using query optimization tools, table statistics, etc. In the case of PostgreSQL, for instance, you could parse the output of
explain count(*) from yourtable
and get a reasonably good estimate of the number of rows. Which brings me to your second question.Seriously? :-) You really mean the exact count from a table with billions of rows? Are you really sure? :-)
If you really do, you could keep a trace of the total using triggers, but mind concurrency and deadlocks if you do.
简单地回答您的问题,否。
如果您需要一种独立于 DBMS 的方法来执行此操作,最快的方法始终是:
某些 DBMS 供应商可能有更快的方法,但仅适用于他们的系统。其中一些选项已经发布在其他答案中。
无论如何,
COUNT(*)
应该由 DBMS(至少是任何值得 PROD 的 DB)进行优化,因此不要尝试绕过它们的优化。附注:
我确信由于您的表大小,您的许多其他查询也需要很长时间才能完成。任何性能问题都应该通过考虑速度的模式设计来解决。我知道您说过这不是一个可以更改的选项,但结果可能是 10 分钟以上的查询也不是一个选项。当您需要速度时,第三个 NF 并不总是最好的方法,有时,如果记录不必须存储在一起,则可以将数据分区到多个表中。需要思考的事情...
To answer your question simply, No.
If you need a DBMS independent way of doing this, the fastest way will always be:
Some DBMS vendors may have quicker ways which will work for their systems only. Some of these options are already posted in other answers.
COUNT(*)
should be optimized by the DBMS (at least any PROD worthy DB) anyway, so don't try to bypass their optimizations.On a side note:
I am sure many of your other queries also take a long time to finish because of your table size. Any performance concerns should probably be addressed by thinking about your schema design with speed in mind. I realize you said that it is not an option to change but it might turn out that 10+ minute queries aren't an option either. 3rd NF is not always the best approach when you need speed, and sometimes data can be partitioned in several tables if the records don't have to be stored together. Something to think about...
如果 SQL Server 版本是 2005/2008,则可以使用 DMV 来计算表中的行数:
对于 SQL Server 2000 数据库引擎,sysindexes 可以使用,但强烈建议避免在 SQL Server 的未来版本中使用它,因为它可能会在不久的将来被删除。
示例代码取自:如何快速轻松地获取表行计数
If SQL Server edition is 2005/2008, you can use DMVs to calculate the row count in a table:
For SQL Server 2000 database engine, sysindexes will work, but it is strongly advised to avoid using it in future editions of SQL Server as it may be removed in the near future.
Sample code taken from: How To Get Table Row Counts Quickly And Painlessly
我用
I use
我远不如其他回答过的专家那么专业,但我在使用从表中选择随机行(不太相关)的过程时遇到了问题,但我需要知道参考表中的行数计算随机指数。使用传统的 Count(*) 或 Count(1) 工作,但我偶尔会需要长达 2 秒的时间来运行查询。因此(对于我的名为“tbl_HighOrder”的表),我使用的是:
它工作得很好,并且 Management Studio 中的查询时间为零。
I'm nowhere near as expert as others who have answered but I was having an issue with a procedure I was using to select a random row from a table (not overly relevant) but I needed to know the number of rows in my reference table to calculate the random index. Using the traditional Count(*) or Count(1) work but I was occasionally getting up to 2 seconds for my query to run. So instead (for my table named 'tbl_HighOrder') I am using:
It works great and query times in Management Studio are zero.
好吧,晚了 5 年,不确定这是否有帮助:
我正在尝试数数。使用 MS SQL Server Management Studio 的 SQL Server 表中的行数并遇到一些溢出错误,然后我使用以下命令:
select count_big(1) FROM [dbname]。 [dbo].[FactSampleValue];
结果:
24296650578 行
Well, late by 5 years and unsure if it helps :
I was trying to count the no. of rows in a SQL Server table using MS SQL Server Management Studio and ran into some overflow error, then I used the below :
select count_big(1) FROM [dbname].[dbo].[FactSampleValue];
The result :
24296650578 rows
不完全是一种与 DBMS 无关的解决方案,但至少您的客户端代码不会看到差异...
创建另一个仅包含一行和一个整数字段 N1 的表 T,并创建 INSERT TRIGGER只是执行:
同时创建一个 DELETE TRIGGER 执行:
一个称职的 DBMS 将保证上述操作的原子性2,并且 N 将始终包含准确的行数,这是超级的- 简单快速:
虽然触发器是特定于 DBMS 的,但从 T 中进行选择却不是,并且您的客户端代码不需要针对每个受支持的 DBMS 进行更改。
但是,如果表是 INSERT 或 DELETE 密集型的,这可能会产生一些可伸缩性问题,特别是如果您在 INSERT/DELETE 后不立即 COMMIT。
1 这些名称只是占位符 - 在生产中使用更有意义的名称。
2 即 N 不能通过读取之间的并发事务进行更改写入N,只要读取和写入都在单个SQL语句中完成即可。
Not exactly a DBMS-agnostic solution, but at least your client code won't see the difference...
Create another table T with just one row and one integer field N1, and create INSERT TRIGGER that just executes:
Also create a DELETE TRIGGER that executes:
A DBMS worth its salt will guarantee the atomicity of the operations above2, and N will contain the accurate count of rows at all times, which is then super-quick to get by simply:
While triggers are DBMS-specific, selecting from T isn't and your client code won't need to change for each supported DBMS.
However, this can have some scalability issues if the table is INSERT or DELETE-intensive, especially if you don't COMMIT immediately after INSERT/DELETE.
1 These names are just placeholders - use something more meaningful in production.
2 I.e. N cannot be changed by a concurrent transaction between reading and writing to N, as long as both reading and writing are done in a single SQL statement.
我认为不存在一个通用的始终最快的解决方案:某些 RDBMS/版本对 SELECT COUNT(*) 有特定的优化,使用更快的选项,而其他版本则只是进行表扫描。您需要访问第二组的文档/支持站点,这可能需要编写一些更具体的查询,通常是以某种方式命中索引的查询。
编辑:
这里有一个可能可行的想法,具体取决于您的架构和数据分布:您是否有一个索引列引用递增的值、数字递增的 ID,甚至是时间戳或日期?然后,假设没有发生删除,应该可以将计数存储到某个最近的值(昨天的日期、最近某个采样点的最高 ID 值),并添加超出该值的计数,这应该会在索引中很快解析。当然,非常依赖于值和索引,但几乎适用于任何 DBMS 的任何版本。
I don't think there is a general always-fastest solution: some RDBMS/versions have a specific optimization for
SELECT COUNT(*)
that use faster options while others simply table-scan. You'd need to go to the documentation/support sites for the second set, which will probably need some more specific query to be written, usually one that hits an index in some way.EDIT:
Here's a thought that might work, depending on your schema and distribution of data: do you have an indexed column that references an increasing value, a numeric increasing ID, say, or even a timestamp or date? Then, assuming deletes don't happen, it should be possible to store the count up to some recent value (yesterday's date, highest ID value at some recent sample point) and add the count beyond that, which should resolve very quickly in the index. Very dependent on values and indices, of course, but applicable to pretty much any version of any DBMS.
我迟到了这个问题,但这是你可以用 MySQL 做的事情(因为我使用 MySQL)。我在这里分享我的观察结果:
结果
行数:508534
控制台输出:受影响的行:0 找到的行:1 警告:0 1 个查询的持续时间:0.125 秒。
对于行数较多的表来说需要一段时间,但行数非常精确。
结果
行数:511235
控制台输出:受影响的行:0 找到的行:1 警告:0 1 个查询的持续时间:0.250 秒
摘要:行数不准确。
结果
行数:507806
控制台输出:受影响的行:0 找到的行:48 警告:0 1 个查询的持续时间:1.701 秒。
行计数不准确。
我不是 MySQL 或数据库专家,但我发现对于非常大的表,您可以使用选项 2 或 3 并“合理地了解”存在多少行。
我需要获取这些行数以便在用户界面上显示一些统计信息。通过上述查询,我知道总行数超过 500,000 行,因此我想出了显示“超过 500,000 行”之类的统计信息,但没有显示确切的行数。
也许我还没有真正回答OP的问题,但我正在分享我在需要此类统计数据的情况下所做的事情。就我而言,显示近似行是可以接受的,因此上述内容对我有用。
I am late to this question, but here is what you can do with MySQL (as I use MySQL). I am sharing my observations here:
Result
Row Count: 508534
Console output: Affected rows: 0 Found rows: 1 Warnings: 0 Duration for 1 query: 0.125 sec.
Takes a while for a table with large number of rows, but the row count is very exact.
Result
Row count: 511235
Console output: Affected rows: 0 Found rows: 1 Warnings: 0 Duration for 1 query: 0.250 sec
Summary: Row count is not exact.
Result
Row count: 507806
Console output: Affected rows: 0 Found rows: 48 Warnings: 0 Duration for 1 query: 1.701 sec.
Row count is not exact.
I am not a MySQL or database expert, but I have found that for very large tables, you can use option 2 or 3 and get a 'fair idea' of how many rows are present.
I needed to get these row counts for displaying some stats on the UI. With the above queries, I knew that the total rows were more than 500,000, so I came up with showing stats like "More than 500,000 rows" without showing exact number of rows.
Maybe I have not really answered the OP's question, but I am sharing what I did in a situation where such statistics were needed. In my case, showing the approximate rows was acceptable and so the above worked for me.
对于我来说,在一个非常大的表中,
需要 37 秒,而
需要 4 秒。
In a very large table for me,
takes 37 seconds whereas
takes 4 seconds.
这确实是一个疯狂的答案,但是如果您设置了某种复制系统(对于具有十亿行的系统,我希望您这样做),您可以使用粗略估计器(例如 MAX(pk) ) >),将该值除以您拥有的从站数量,并行运行多个查询。
在大多数情况下,您可以根据最佳键(或我猜的主键)对从属设备之间的查询进行分区,以这种方式(我们将使用 250000000 作为我们的行/从属设备):
但是您需要 SQL仅有的。真是个半身像。好吧,假设你是一个施虐受虐狂。
在主服务器(或最近的从服务器)上,您很可能需要为此创建一个表:
因此,您必须执行插入操作,而不是仅在从服务器中运行选择,类似于:
您可能会遇到从服务器写入主服务器上的表的问题。您可能需要获得更多的 sadis - 我的意思是,创造性:
您最终应该有一个相对于第一个从属设备存在于复制图遍历的路径中最后的从属设备。该从属设备现在应该具有所有其他计数器值,并且应该具有其自己的值。但是当您完成时,可能会添加行,因此您必须插入另一行来补偿 counter_table 中记录的最大 pk 和当前的最大 pk。
那时,您必须执行聚合函数来计算总行数,但这更容易,因为您最多只能在“您拥有和更改的从属数量”行上运行它。
如果您的从属表中有单独的表,则可以通过 UNION 来获取所需的所有行。
或者你知道,不要那么疯狂,将数据迁移到分布式处理系统,或者使用数据仓库解决方案(这也将在未来为你提供出色的数据处理)。
请注意,这确实取决于您的复制设置得如何。由于主要瓶颈很可能是持久存储,因此如果您的存储不完整或隔离不良且邻居噪音很大的数据存储,这可能会比仅仅等待单个 SELECT COUNT(*) ...< /code>
但如果你有良好的复制,那么你的速度增益应该与从站的数量直接相关。事实上,如果单独运行计数查询需要 10 分钟,并且您有 8 个从属设备,那么您可以将时间缩短到不到几分钟。也许需要一个小时来解决这个解决方案的细节。
当然,您永远不会真正得到非常准确的答案,因为这种分布式解决方案引入了可以删除和插入行的一些时间,但您可以尝试在同一实例中获取行的分布式锁并获得精确的计数特定时刻表中的行数。
实际上,这似乎是不可能的,因为您基本上只能使用仅 SQL 的解决方案,并且我不认为您提供了一种机制来立即跨多个从站运行分片和锁定查询。也许如果您可以控制复制日志文件...这意味着您实际上会为此目的而旋转从属设备,这无疑比在单台计算机上运行计数查询要慢。
这是我的两枚 2013 年便士。
A literally insane answer, but if you have some kind of replication system set up (for a system with a billion rows, I hope you do), you can use a rough-estimator (like
MAX(pk)
), divide that value by the number of slaves you have, run several queries in parallel.For the most part, you'd partition the queries across slaves based on the best key (or the primary key I guess), in such a way (we're going to use 250000000 as our Rows / Slaves):
But you need SQL only. What a bust. Ok, so let's say you're a sadomasochist.
On the master (or closest slave) you'd most likely need to create a table for this:
So instead of only having the selects running in your slaves, you'd have to do an insert, akin to this:
You may run into issues with slaves writing to a table on master. You may need to get even more sadis- I mean, creative:
You should in the end have a slave that exists last in the path traversed by the replication graph, relative to the first slave. That slave should now have all other counter values, and should have its own values. But by the time you've finished, there probably are rows added, so you'd have to insert another one compensating for the recorded max pk in your counter_table and the current max pk.
At that point, you'd have to do an aggregate function to figure out what the total rows are, but that's easier since you'd be running it on at most the "number of slaves you have and change" rows.
If you're in the situation where you have separate tables in the slaves, you can
UNION
to get all the rows you need.Or you know, be a bit less insane and migrate your data to a distributed processing system, or maybe use a Data Warehousing solution (which will give you awesome data crunching in the future too).
Do note, this does depend on how well your replication is set up. Since the primary bottleneck will most likely be persistent storage, if you have cruddy storage or poorly segregated data stores with heavy neighbor noise, this will probably run you slower than just waiting for a single
SELECT COUNT(*) ...
But if you have good replication, then your speed gains should be directly related to the number or slaves. In fact, if it takes 10 minutes to run the counting query alone, and you have 8 slaves, you'd cut your time to less than a couple minutes. Maybe an hour to iron out the details of this solution.
Of course, you'd never really get an amazingly accurate answer since this distributed solving introduces a bit of time where rows can be deleted and inserted, but you can try to get a distributed lock of rows at the same instance and get a precise count of the rows in the table for a particular moment in time.
Actually, this seems impossible, since you're basically stuck with an SQL-only solution, and I don't think you're provided a mechanism to run a sharded and locked query across multiple slaves, instantly. Maybe if you had control of the replication log file... which means you'd literally be spinning up slaves for this purpose, which is no doubt slower than just running the count query on a single machine anyway.
So there's my two 2013 pennies.
如果插入触发器使用起来太昂贵,但是可以负担删除触发器,并且有一个自动递增
id
< /strong>,然后对整个表格进行一次计数,并记住计数为last-count
和last-counted-id
,然后每天 只需要计算
id
>last-counted-id
,将其添加到last-count
,并存储新的last-counted-id
。如果已删除记录的 id <= Last-counted-id,则删除触发器将递减 Last-Count。
If insert trigger is too expensive to use, but a delete trigger could be afforded, and there is an auto-increment
id
, then after counting entire table once, and remembering the count aslast-count
and thelast-counted-id
,then each day just need to count for
id
>last-counted-id
, add that tolast-count
, and store the newlast-counted-id
.The delete trigger would decrement last-count, if id of deleted record <= last-counted-id.
如果您有一个典型的表结构,具有自动递增主键列,其中的行永远不会被删除,则以下将是确定记录计数的最快方法,并且应该在大多数 ANSI 兼容数据库中以类似方式工作:
我使用 MS SQL 表包含数十亿行,需要亚秒级的数据响应时间,包括记录计数。相比之下,类似的 SELECT COUNT(*) 将需要几分钟的时间来处理。
If you have a typical table structure with an auto-incrementing primary key column in which rows are never deleted, the following will be the fastest way to determine the record count and should work similarly across most ANSI compliant databases:
I work with MS SQL tables containing billions of rows that require sub-second response times for data, including record counts. A similar SELECT COUNT(*) would take minutes to process by comparison.
对于 Sql 服务器尝试这个
For Sql server try this
使用
COUNT_BIG()
获取非常大的文件中的记录数。Use
COUNT_BIG()
for fetching the count of records in a very large sized file.在某些列上放置索引。这应该允许优化器执行索引块的完整扫描,而不是表的完整扫描。这将大大降低您的 IO 成本。看看前后的执行计划。然后双向测量挂钟时间。
Put an index on some column. That should allow the optimizer to perform a full scan of the index blocks, instead of a full scan of the table. That will cut your IO costs way down. Look at the execution plan before and after. Then measure wall clock time both ways.
如果您使用的是 Oracle,那么这样如何(假设表统计信息已更新):
last_analyzed 将显示上次收集统计信息的时间。
If you are using Oracle, how about this (assuming the table stats are updated):
last_analyzed will show the time when stats were last gathered.
使用 PostgreSQL:
With PostgreSQL:
在 SQL Server 2019 中,您可以使用 APPROX_COUNT_DISTINCT,其中:
从文档中
此外,该函数
实现其背后的算法是HyperLogLog。
With SQL Server 2019, you can use APPROX_COUNT_DISTINCT, which:
and from the docs:
Also, the function
The algorithm behind the implementation its HyperLogLog.
在 SQL Server 2016 中,我只需检查表属性,然后选择“存储”选项卡 - 这将为我提供行数、表使用的磁盘空间、使用的索引空间等。
In SQL server 2016, I can just check table properties and then select 'Storage' tab - this gives me row count, disk space used by the table, index space used etc.
也许有点晚了,但这可能会帮助其他人使用 MSSQL
Maybe a bit late but this might help others for MSSQL