快速发现 PostgreSQL 中表的行数的方法
我需要知道表中的行数才能计算百分比。如果总计数大于某个预定义的常数,我将使用该常数值。否则,我将使用实际行数。
我可以使用SELECT count(*) FROM table
。但是,如果我的常量值为 500,000 并且我的表中有 5,000,000,000 行,计算所有行将浪费大量时间。
是否可以一旦超过我的常数值就停止计数?
只要行数低于给定的限制,我就需要确切的行数。否则,如果计数高于限制,我将使用限制值并希望尽快得到答案。
像这样的事情:
SELECT text,count(*), percentual_calculus()
FROM token
GROUP BY text
ORDER BY count DESC;
I need to know the number of rows in a table to calculate a percentage. If the total count is greater than some predefined constant, I will use the constant value. Otherwise, I will use the actual number of rows.
I can use SELECT count(*) FROM table
. But if my constant value is 500,000 and I have 5,000,000,000 rows in my table, counting all rows will waste a lot of time.
Is it possible to stop counting as soon as my constant value is surpassed?
I need the exact number of rows only as long as it's below the given limit. Otherwise, if the count is above the limit, I use the limit value instead and want the answer as fast as possible.
Something like this:
SELECT text,count(*), percentual_calculus()
FROM token
GROUP BY text
ORDER BY count DESC;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
众所周知,在 PostgreSQL 中计算大表中的行数很慢。 MVCC 模型需要对活动行进行完整计数以获得精确的数字。如果计数不必必须像您的情况那样精确,可以使用一些解决方法来大大加快速度。
(请记住,即使是“精确”计数,在并发写入负载下也可能在到达时失效。)
对于大表,精确计数
慢。
对于并发写入操作,它可能在您获得它的那一刻就已经过时了。
估计
极快:
通常情况下,估计值非常接近。有多接近,取决于
ANALYZE
或VACUUM
运行得足够 - 其中“足够” “由写入活动的级别定义你的桌子。更安全的估计
上面忽略了一个数据库中具有相同名称的多个表的可能性 - 在不同的模式中。考虑到这一点:
转换为
bigint
可以很好地格式化real
数字,尤其是对于大计数。更好的估计
更快、更简单、更安全、更优雅。请参阅对象标识符类型手册。
在 Postgres 9.4+ 中,将
'myschema.mytable'::regclass
替换为to_regclass('myschema.mytable')
不会得到任何结果,而是出现无效表名的异常。请参阅:更好的估计(只需很少的额外成本)
这不适用于分区表,因为对于父表,
relpages
始终为 -1(而reltuples
包含覆盖所有分区的实际估计值) - 在 Postgres 14 中进行测试。您必须将所有分区的估计值相加。
我们可以做 Postgres 规划器所做的事情。引用行估计示例 手册中:
Postgres 使用
src/backend/ 中定义的
,它还涵盖了estimate_rel_size
utils/adt/plancat.cpg_class
中没有数据的极端情况,因为关系从未被清理过。我们可以在 SQL 中做类似的事情:最小形式
安全且显式
不会破坏空表和从未见过
VACUUM
或ANALYZE
的表。pg_class
的手册:如果此查询返回
NULL
,请对表运行ANALYZE
或VACUUM
并重复。 (或者,您可以像 Postgres 一样根据列类型估计行宽,但这很乏味且容易出错。)如果此查询返回
0
,则表似乎是空的。但我会分析
来确定。 (也许还可以检查您的autovacuum
设置。)通常,
block_size
为 8192。current_setting('block_size')::int
涵盖罕见的例外情况。表和架构限定使其不受任何
search_path
和范围的影响。无论哪种方式,查询始终采用 <对我来说是 0.1 毫秒。
更多网络资源:
TABLESAMPLE SYSTEM (n)
in Postgres 9.5+Like @a_horse 评论了,如果
pg_class
中的统计信息不够最新,则为SELECT
命令添加的子句可能会很有用由于某种原因。例如:autovacuum
。INSERT
/UPDATE
/DELETE
。TEMPORARY
表(autovacuum
未涵盖)。这仅查看随机 n %(示例中为
1
)选择的块并计算其中的行数。更大的样本会增加成本并减少错误,您可以选择。准确性取决于更多因素:FILLFACTOR
占用每个块的空间。如果表中分布不均匀,估计值可能会偏离。通常,
pg_class
的估计会更快、更准确。回答实际问题
以及是否...
是的。您可以使用带有
LIMIT
的子查询:Postgres实际上会停止计数超过给定的限制,您会得到一个精确当前计数最多为 n 行(示例中为 500000 行),否则为 n 行。不过,远没有 pg_class 中估计的那么快。
Counting rows in big tables is known to be slow in PostgreSQL. The MVCC model requires a full count of live rows for a precise number. There are workarounds to speed this up dramatically if the count does not have to be exact like it seems to be in your case.
(Remember that even an "exact" count is potentially dead on arrival under concurrent write load.)
Exact count
Slow for big tables.
With concurrent write operations, it may be outdated the moment you get it.
Estimate
Extremely fast:
Typically, the estimate is very close. How close, depends on whether
ANALYZE
orVACUUM
are run enough - where "enough" is defined by the level of write activity to your table.Safer estimate
The above ignores the possibility of multiple tables with the same name in one database - in different schemas. To account for that:
The cast to
bigint
formats thereal
number nicely, especially for big counts.Better estimate
Faster, simpler, safer, more elegant. See the manual on Object Identifier Types.
Replace
'myschema.mytable'::regclass
withto_regclass('myschema.mytable')
in Postgres 9.4+ to get nothing instead of an exception for invalid table names. See:Better estimate yet (for very little added cost)
This does not work for partitioned tables because
relpages
is always -1 for the parent table (whilereltuples
contains an actual estimate covering all partitions) - tested in Postgres 14.You have to add up estimates for all partitions instead.
We can do what the Postgres planner does. Quoting the Row Estimation Examples in the manual:
Postgres uses
estimate_rel_size
defined insrc/backend/utils/adt/plancat.c
, which also covers the corner case of no data inpg_class
because the relation was never vacuumed. We can do something similar in SQL:Minimal form
Safe and explicit
Doesn't break with empty tables and tables that have never seen
VACUUM
orANALYZE
. The manual onpg_class
:If this query returns
NULL
, runANALYZE
orVACUUM
for the table and repeat. (Alternatively, you could estimate row width based on column types like Postgres does, but that's tedious and error-prone.)If this query returns
0
, the table seems to be empty. But I wouldANALYZE
to make sure. (And maybe check yourautovacuum
settings.)Typically,
block_size
is 8192.current_setting('block_size')::int
covers rare exceptions.Table and schema qualifications make it immune to any
search_path
and scope.Either way, the query consistently takes < 0.1 ms for me.
More Web resources:
TABLESAMPLE SYSTEM (n)
in Postgres 9.5+Like @a_horse commented, the added clause for the
SELECT
command can be useful if statistics inpg_class
are not current enough for some reason. For example:autovacuum
running.INSERT
/UPDATE
/DELETE
.TEMPORARY
tables (which are not covered byautovacuum
).This only looks at a random n % (
1
in the example) selection of blocks and counts rows in it. A bigger sample increases the cost and reduces the error, your pick. Accuracy depends on more factors:FILLFACTOR
occupy space per block. If unevenly distributed across the table, the estimate may be off.Typically, the estimate from
pg_class
will be faster and more accurate.Answer to actual question
And whether it ...
Yes. You can use a subquery with
LIMIT
:Postgres actually stops counting beyond the given limit, you get an exact and current count for up to n rows (500000 in the example), and n otherwise. Not nearly as fast as the estimate in
pg_class
, though.我在 postgres 应用程序中通过运行执行此操作:
然后使用正则表达式或类似逻辑检查输出。对于简单的 SELECT *,输出的第一行应如下所示:
您可以使用
rows=(\d+)
值作为将返回的行数的粗略估计,然后仅当估计值小于阈值的 1.5 倍(或您认为对您的应用程序有意义的任何数字)时,才执行实际的SELECT COUNT(*)
。根据查询的复杂程度,该数字可能会变得越来越不准确。事实上,在我的应用程序中,当我们添加联接和复杂条件时,它变得非常不准确,甚至完全没有价值,甚至知道我们会返回多少行,因此我们不得不放弃该策略。
但是,如果您的查询足够简单,Pg 可以在合理的误差范围内预测它将返回多少行,那么它可能适合您。
I did this once in a postgres app by running:
Then examining the output with a regex, or similar logic. For a simple SELECT *, the first line of output should look something like this:
You can use the
rows=(\d+)
value as a rough estimate of the number of rows that would be returned, then only do the actualSELECT COUNT(*)
if the estimate is, say, less than 1.5x your threshold (or whatever number you deem makes sense for your application).Depending on the complexity of your query, this number may become less and less accurate. In fact, in my application, as we added joins and complex conditions, it became so inaccurate it was completely worthless, even to know how within a power of 100 how many rows we'd have returned, so we had to abandon that strategy.
But if your query is simple enough that Pg can predict within some reasonable margin of error how many rows it will return, it may work for you.
参考来自此博客。
您可以使用下面的查询来查找行数。
使用 pg_class:
使用 pg_stat_user_tables:
Reference taken from this Blog.
You can use below to query to find row count.
Using pg_class:
Using pg_stat_user_tables:
这是实现 @Jonathan Hall 答案的函数(不是我的):
This is the function (not mine) that implements @Jonathan Hall's answer:
文本栏有多宽?
使用 GROUP BY 时,您无法采取太多措施来避免数据扫描(至少是索引扫描)。
我建议:
如果可能,更改架构以删除重复的文本数据。这样,计数将发生在“many”表中的窄外键字段上。
或者,使用文本的哈希创建生成列,然后按哈希列进行分组。
同样,这是为了减少工作量(扫描狭窄的列索引)
编辑:
您原来的问题与您的编辑不太匹配。我不确定您是否知道 COUNT 与 GROUP BY 一起使用时,将返回每个组的项目数,而不是整个表中的项目数。
How wide is the text column?
With a GROUP BY there's not much you can do to avoid a data scan (at least an index scan).
I'd recommend:
If possible, changing the schema to remove duplication of text data. This way the count will happen on a narrow foreign key field in the 'many' table.
Alternatively, creating a generated column with a HASH of the text, then GROUP BY the hash column.
Again, this is to decrease the workload (scan through a narrow column index)
Edit:
Your original question did not quite match your edit. I'm not sure if you're aware that the COUNT, when used with a GROUP BY, will return the count of items per group and not the count of items in the entire table.
在 Oracle 中,您可以使用 rownum 来限制返回的行数。我猜测其他 SQL 中也存在类似的构造。因此,对于您给出的示例,您可以将返回的行数限制为 500001 并应用
count(*)
然后:In Oracle, you could use
rownum
to limit the number of rows returned. I am guessing similar construct exists in other SQLs as well. So, for the example you gave, you could limit the number of rows returned to 500001 and apply acount(*)
then:您也可以直接
SELECT MAX(id) FROM
;将 id 更改为表的 PK 值You can also just
SELECT MAX(id) FROM <table_name>
; change id to whatever the PK of the table is对于 SQL Server(2005 或更高版本),快速且可靠的方法是:
MSDN
该查询从(可能)分区表的所有部分添加行。
index_id=0 是无序表(堆),index_id=1 是有序表(聚集索引)
甚至更快(但不可靠)的方法有详细介绍 这里。
For SQL Server (2005 or above) a quick and reliable method is:
Details about sys.dm_db_partition_stats are explained in MSDN
The query adds rows from all parts of a (possibly) partitioned table.
index_id=0 is an unordered table (Heap) and index_id=1 is an ordered table (clustered index)
Even faster (but unreliable) methods are detailed here.