我需要知道表中的行数以计算一个百分比。如果总计数大于某些预定义常数,我将使用常数值。否则,我将使用实际行数。
我可以使用从表
中选择计数(*)。但是,如果我的恒定值为 500,000 ,并且我的桌子上有5,000,000,000 的行,那么计数所有行将浪费很多时间。
一旦我的恒定价值超过,是否可以停止计数?
我只需要精确的行数,只要它低于给定的限制。否则,如果计数高于限制,我改用限制值,并希望尽快答案。
这样的事情:
SELECT text,count(*), percentual_calculus()
FROM token
GROUP BY text
ORDER BY count DESC;
I need to know the number of rows in a table to calculate a percentage. If the total count is greater than some predefined constant, I will use the constant value. Otherwise, I will use the actual number of rows.
I can use SELECT count(*) FROM table
. But if my constant value is 500,000 and I have 5,000,000,000 rows in my table, counting all rows will waste a lot of time.
Is it possible to stop counting as soon as my constant value is surpassed?
I need the exact number of rows only as long as it's below the given limit. Otherwise, if the count is above the limit, I use the limit value instead and want the answer as fast as possible.
Something like this:
SELECT text,count(*), percentual_calculus()
FROM token
GROUP BY text
ORDER BY count DESC;
发布评论
评论(8)
众所周知,在大桌上计数行的速度很慢。 MVCC 模型需要全面的活行计数,以确保精确的数字。如果计数确实,那么 必须像您的情况下一样。
(请记住,即使是“精确”计数在同一写入负载下到达时也可能死亡。)
精确计数
慢> 对于大桌子。
通过同时进行写操作,它可能会过时。
估计
非常快速:
通常,估计值非常接近。有多近,取决于是否
分析>分析
或vacuum
vacuum 足够运行。 “由您表的写入级别定义。更安全的估计
以上忽略了一个数据库中具有相同名称的多个表的可能性 - 在不同的模式中。为此说明:
铸件为
bigint
格式real
数字很好,尤其是对于大数。更好的估计
更快,更简单,更安全,更优雅。请参阅对象标识符类型的手册。
替换
'myschema.mytable':: regclass
用to_regclass('myschema.mytable') in Postgres 9.4+中的
,以获取任何东西,而不是无效表名的例外。请参阅:更好的估计值(几乎没有增加的成本)
此对分区表不起作用,因为
relpages
始终为-1父表(reltuples
包含一个涵盖所有分区的实际估计) - 在Postgres 14.中进行了测试。
您必须代替所有分区的估算值。
我们可以做Postgres计划者所做的事情。引用 em>在手册中 :
Postgres使用
estimate_rel_size
,它也涵盖了pg_class
中没有数据的角案例,因为这种关系从未被吸尘。我们可以在SQL中执行类似的事情:最小形式
安全且显式
不会用从未见过的空表和表格中断
真空>真空>或
分析
。pg_class on on
pg_class
如果此查询返回
null
,请运行 或vacuum
for表和重复。 (或者,您可以根据Postgres这样的列类型估算行宽度,但这很乏味且容易出错。)如果此查询返回
0
,则表似乎是空的。但是我会分析
以确保。 (也可以检查您的autovacuum
设置。)通常,
block_size
IS 8192。current_setting('block_size'):: int
涵盖了稀有例外。表和模式资格使其不受任何
search_path
和范围的影响。无论哪种方式,查询都一贯采用<我的0.1毫秒。
更多网络资源:
tablesame system(n)
在Postgres 9.5+中,喜欢@a_horse commented commented ,
select> select code>命令的添加子句可能有用如果出于某种原因,
pg_class
中的统计信息还不够当前。例如:autovacuum
运行。更新
/删除>删除
之后。临时
表(autovacuum
不涵盖)。这仅查看随机 n %(示例中的
1
)选择块中的块和计数行。更大的样本增加了成本并减少了您的选择。精度取决于更多因素:fillfactor
每个块占据空间。如果在桌子上分布不均,则估计值可能会关闭。通常,
pg_class
的估计将更快,更准确。回答实际问题
是否...
是的。您可以使用子查询
limit
:Postgres 实际上停止计数超出给定限制,您会得到一个精确和电流最多可计入 n 行(示例中的500000),而 n 否则。不过,不如
pg_class
中的估计值快。Counting rows in big tables is known to be slow in PostgreSQL. The MVCC model requires a full count of live rows for a precise number. There are workarounds to speed this up dramatically if the count does not have to be exact like it seems to be in your case.
(Remember that even an "exact" count is potentially dead on arrival under concurrent write load.)
Exact count
Slow for big tables.
With concurrent write operations, it may be outdated the moment you get it.
Estimate
Extremely fast:
Typically, the estimate is very close. How close, depends on whether
ANALYZE
orVACUUM
are run enough - where "enough" is defined by the level of write activity to your table.Safer estimate
The above ignores the possibility of multiple tables with the same name in one database - in different schemas. To account for that:
The cast to
bigint
formats thereal
number nicely, especially for big counts.Better estimate
Faster, simpler, safer, more elegant. See the manual on Object Identifier Types.
Replace
'myschema.mytable'::regclass
withto_regclass('myschema.mytable')
in Postgres 9.4+ to get nothing instead of an exception for invalid table names. See:Better estimate yet (for very little added cost)
This does not work for partitioned tables because
relpages
is always -1 for the parent table (whilereltuples
contains an actual estimate covering all partitions) - tested in Postgres 14.You have to add up estimates for all partitions instead.
We can do what the Postgres planner does. Quoting the Row Estimation Examples in the manual:
Postgres uses
estimate_rel_size
defined insrc/backend/utils/adt/plancat.c
, which also covers the corner case of no data inpg_class
because the relation was never vacuumed. We can do something similar in SQL:Minimal form
Safe and explicit
Doesn't break with empty tables and tables that have never seen
VACUUM
orANALYZE
. The manual onpg_class
:If this query returns
NULL
, runANALYZE
orVACUUM
for the table and repeat. (Alternatively, you could estimate row width based on column types like Postgres does, but that's tedious and error-prone.)If this query returns
0
, the table seems to be empty. But I wouldANALYZE
to make sure. (And maybe check yourautovacuum
settings.)Typically,
block_size
is 8192.current_setting('block_size')::int
covers rare exceptions.Table and schema qualifications make it immune to any
search_path
and scope.Either way, the query consistently takes < 0.1 ms for me.
More Web resources:
TABLESAMPLE SYSTEM (n)
in Postgres 9.5+Like @a_horse commented, the added clause for the
SELECT
command can be useful if statistics inpg_class
are not current enough for some reason. For example:autovacuum
running.INSERT
/UPDATE
/DELETE
.TEMPORARY
tables (which are not covered byautovacuum
).This only looks at a random n % (
1
in the example) selection of blocks and counts rows in it. A bigger sample increases the cost and reduces the error, your pick. Accuracy depends on more factors:FILLFACTOR
occupy space per block. If unevenly distributed across the table, the estimate may be off.Typically, the estimate from
pg_class
will be faster and more accurate.Answer to actual question
And whether it ...
Yes. You can use a subquery with
LIMIT
:Postgres actually stops counting beyond the given limit, you get an exact and current count for up to n rows (500000 in the example), and n otherwise. Not nearly as fast as the estimate in
pg_class
, though.我在Postgres应用程序中进行了一次操作:
然后使用正则逻辑或类似逻辑检查输出。对于简单的选择 *,第一行输出应该看起来像这样:
您可以使用
rows =(\ d+)
值作为将返回的行数的粗略估计,然后仅实际选择计数(*)
如果估计值小于1.5倍您的阈值(或您认为的任何数字对您的应用程序有意义)。根据查询的复杂性,此数字可能变得越来越不准确。实际上,在我的应用中,随着我们添加加入和复杂的条件,它变得如此不准确,甚至是完全毫无价值的,甚至知道如何在100范围内我们返回多少行,因此我们不得不放弃该策略。
但是,如果您的查询非常简单,以至于PG可以在某个合理的错误范围内预测其返回多少行,则可能对您有用。
I did this once in a postgres app by running:
Then examining the output with a regex, or similar logic. For a simple SELECT *, the first line of output should look something like this:
You can use the
rows=(\d+)
value as a rough estimate of the number of rows that would be returned, then only do the actualSELECT COUNT(*)
if the estimate is, say, less than 1.5x your threshold (or whatever number you deem makes sense for your application).Depending on the complexity of your query, this number may become less and less accurate. In fact, in my application, as we added joins and complex conditions, it became so inaccurate it was completely worthless, even to know how within a power of 100 how many rows we'd have returned, so we had to abandon that strategy.
But if your query is simple enough that Pg can predict within some reasonable margin of error how many rows it will return, it may work for you.
参考从此博客中。
您可以在下面使用以查询查找行计数。
使用pg_class:
使用pg_stat_user_tables:
Reference taken from this Blog.
You can use below to query to find row count.
Using pg_class:
Using pg_stat_user_tables:
这是实现@Jonathan Hall的答案的功能(不是我的):
This is the function (not mine) that implements @Jonathan Hall's answer:
文本列有多大?
有了一个小组,您无能为力避免进行数据扫描(至少进行索引扫描)。
我建议:
如果可能的话,更改架构以删除文本数据的重复。这样,计数将在“许多”表中的狭窄外键字段上发生。
或者,创建具有文本哈希的生成列,然后按哈希列组成。
同样,这是为了减少工作负载(通过狭窄的列索引扫描)
编辑:
您的原始问题与您的编辑不完全匹配。我不确定您是否知道与组一起使用时,计数是否会返回每个组中的项目计数,而不是整个表中的项目计数。
How wide is the text column?
With a GROUP BY there's not much you can do to avoid a data scan (at least an index scan).
I'd recommend:
If possible, changing the schema to remove duplication of text data. This way the count will happen on a narrow foreign key field in the 'many' table.
Alternatively, creating a generated column with a HASH of the text, then GROUP BY the hash column.
Again, this is to decrease the workload (scan through a narrow column index)
Edit:
Your original question did not quite match your edit. I'm not sure if you're aware that the COUNT, when used with a GROUP BY, will return the count of items per group and not the count of items in the entire table.
在Oracle中,您可以使用
Rownum
来限制返回的行数。我猜其他SQL也存在类似的结构。因此,对于您给出的例子,您可以将返回的行数限制为500001,然后应用count(*)
然后:In Oracle, you could use
rownum
to limit the number of rows returned. I am guessing similar construct exists in other SQLs as well. So, for the example you gave, you could limit the number of rows returned to 500001 and apply acount(*)
then:您也可以只有
从&lt; table_name&gt;
中选择最大(id);将ID更改为桌子的PKYou can also just
SELECT MAX(id) FROM <table_name>
; change id to whatever the PK of the table is对于SQL Server(2005或更高),快速可靠方法是:
有关sys.dm_dm_db_partition_stats的详细信息,请在 msdn
查询添加了(可能是)分区表的所有部分的行。
index_id = 0是一个无序的表(堆),index_id = 1是一个有序的表(群集索引)
甚至更快(但不可靠的)方法详细在这里。
For SQL Server (2005 or above) a quick and reliable method is:
Details about sys.dm_db_partition_stats are explained in MSDN
The query adds rows from all parts of a (possibly) partitioned table.
index_id=0 is an unordered table (Heap) and index_id=1 is an ordered table (clustered index)
Even faster (but unreliable) methods are detailed here.