快速发现PostgreSQL中表的行计数

发布于 2025-01-25 22:40:44 字数 392 浏览 1 评论 0 原文

我需要知道表中的行数以计算一个百分比。如果总计数大于某些预定义常数,我将使用常数值。否则,我将使用实际行数。

我可以使用从表中选择计数(*)。但是,如果我的恒定值为 500,000 ,并且我的桌子上有5,000,000,000 的行,那么计数所有行将浪费很多时间。

一旦我的恒定价值超过,是否可以停止计数?

我只需要精确的行数,只要它低于给定的限制。否则,如果计数高于限制,我改用限制值,并希望尽快答案。

这样的事情:

SELECT text,count(*), percentual_calculus()  
FROM token  
GROUP BY text  
ORDER BY count DESC;

I need to know the number of rows in a table to calculate a percentage. If the total count is greater than some predefined constant, I will use the constant value. Otherwise, I will use the actual number of rows.

I can use SELECT count(*) FROM table. But if my constant value is 500,000 and I have 5,000,000,000 rows in my table, counting all rows will waste a lot of time.

Is it possible to stop counting as soon as my constant value is surpassed?

I need the exact number of rows only as long as it's below the given limit. Otherwise, if the count is above the limit, I use the limit value instead and want the answer as fast as possible.

Something like this:

SELECT text,count(*), percentual_calculus()  
FROM token  
GROUP BY text  
ORDER BY count DESC;

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

居里长安 2025-02-01 22:40:44

众所周知,在大桌上计数行的速度很慢。 MVCC 模型需要全面的活行计数,以确保精确的数字。如果计数确实,那么 必须像您的情况下一样。

(请记住,即使是“精确”计数在同一写入负载下到达时也可能死亡。)

精确计数

慢> 对于大桌子。
通过同时进行写操作,它可能会过时。

SELECT count(*) AS exact_count FROM myschema.mytable;
估计

非常快速

SELECT reltuples::bigint AS estimate FROM pg_class where relname = 'mytable';

通常,估计值非常接近。有多近,取决于是否 分析>分析 vacuum vacuum 足够运行。 “由您表的写入级别定义。

更安全的估计

以上忽略了一个数据库中具有相同名称的多个表的可能性 - 在不同的模式中。为此说明:

SELECT c.reltuples::bigint AS estimate
FROM   pg_class c
JOIN   pg_namespace n ON n.oid = c.relnamespace
WHERE  c.relname = 'mytable'
AND    n.nspname = 'myschema';

铸件为 bigint 格式 real 数字很好,尤其是对于大数。

更好的估计

SELECT reltuples::bigint AS estimate
FROM   pg_class
WHERE  oid = 'myschema.mytable'::regclass;

更快,更简单,更安全,更优雅。请参阅对象标识符类型的手册。

替换'myschema.mytable':: regclass to_regclass('myschema.mytable') in Postgres 9.4+中的,以获取任何东西,而不是无效表名的例外。请参阅:

更好的估计值(几乎没有增加的成本)

对分区表不起作用,因为 relpages 始终为-1父表( reltuples 包含一个涵盖所有分区的实际估计) - 在Postgres 14.
中进行了测试。
您必须代替所有分区的估算值。

我们可以做Postgres计划者所做的事情。引用 em>在手册中

这些数字是最后的真空>或分析的最新数字
桌子。然后,策划者获取了实际的当前页面数量
该表(这是一个便宜的操作,不需要表扫描)。如果
这与 relpages 不同,然后 reltuples 被缩放
因此,要得出当前的行估计数。


Postgres使用 estimate_rel_size ,它也涵盖了 pg_class 中没有数据的角案例,因为这种关系从未被吸尘。我们可以在SQL中执行类似的事情:

最小形式

SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM   pg_class
WHERE  oid = 'mytable'::regclass;  -- your table here

安全且显式

SELECT (CASE WHEN c.reltuples < 0 THEN NULL       -- never vacuumed
             WHEN c.relpages = 0 THEN float8 '0'  -- empty table
             ELSE c.reltuples / c.relpages END
     * (pg_catalog.pg_relation_size(c.oid)
      / pg_catalog.current_setting('block_size')::int)
       )::bigint
FROM   pg_catalog.pg_class c
WHERE  c.oid = 'myschema.mytable'::regclass;      -- schema-qualified table here

不会用从未见过的空表和表格中断真空>真空>或分析 pg_class on on pg_class

如果该表从未被吸尘或分析,则 reltuples 包含 -1 指示行计数未知。

如果此查询返回 null ,请运行 或 vacuum for表和重复。 (或者,您可以根据Postgres这样的列类型估算行宽度,但这很乏味且容易出错。)

如果此查询返回 0 ,则表似乎是空的。但是我会分析以确保。 (也可以检查您的 autovacuum 设置。)

通常, block_size IS 8192。 current_setting('block_size'):: int 涵盖了稀有例外。

表和模式资格使其不受任何 search_path 和范围的影响。

无论哪种方式,查询都一贯采用&lt;我的0.1毫秒。

更多网络资源:

tablesame system(n) 在Postgres 9.5+中,

SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);

喜欢@a_horse commented commented select> select code>命令的添加子句可能有用如果出于某种原因, pg_class 中的统计信息还不够当前。例如:

  • autovacuum 运行。
  • 大<代码>插入/更新/删除>删除之后。
  • 临时表( autovacuum 不涵盖)。

这仅查看随机 n %(示例中的 1 )选择块中的块和计数行。更大的样本增加了成本并减少了您的选择。精度取决于更多因素:

  • 行大小的分布。如果给定的块恰好比通常的行更宽,则计数比平常低。
  • 死元组或 fillfactor 每个块占据空间。如果在桌子上分布不均,则估计值可能会关闭。
  • 一般的四舍五入错误。

通常, pg_class 的估计将更快,更准确。

回答实际问题

首先,我需要知道该表中的行数,如果总计
计数大于某些预定义常数,

是否...

...目前计数通过我的恒定价值,它将
停止计数(不要等待完成计数以告知
行计数更大)。

是的。您可以使用子查询 limit

SELECT count(*) FROM (SELECT 1 FROM token LIMIT 500000) t;

Postgres 实际上停止计数超出给定限制,您会得到一个精确和电流最多可计入 n 行(示例中的500000),而 n 否则。不过,不如 pg_class 中的估计值快。

Counting rows in big tables is known to be slow in PostgreSQL. The MVCC model requires a full count of live rows for a precise number. There are workarounds to speed this up dramatically if the count does not have to be exact like it seems to be in your case.

(Remember that even an "exact" count is potentially dead on arrival under concurrent write load.)

Exact count

Slow for big tables.
With concurrent write operations, it may be outdated the moment you get it.

SELECT count(*) AS exact_count FROM myschema.mytable;
Estimate

Extremely fast:

SELECT reltuples::bigint AS estimate FROM pg_class where relname = 'mytable';

Typically, the estimate is very close. How close, depends on whether ANALYZE or VACUUM are run enough - where "enough" is defined by the level of write activity to your table.

Safer estimate

The above ignores the possibility of multiple tables with the same name in one database - in different schemas. To account for that:

SELECT c.reltuples::bigint AS estimate
FROM   pg_class c
JOIN   pg_namespace n ON n.oid = c.relnamespace
WHERE  c.relname = 'mytable'
AND    n.nspname = 'myschema';

The cast to bigint formats the real number nicely, especially for big counts.

Better estimate

SELECT reltuples::bigint AS estimate
FROM   pg_class
WHERE  oid = 'myschema.mytable'::regclass;

Faster, simpler, safer, more elegant. See the manual on Object Identifier Types.

Replace 'myschema.mytable'::regclass with to_regclass('myschema.mytable') in Postgres 9.4+ to get nothing instead of an exception for invalid table names. See:

Better estimate yet (for very little added cost)

This does not work for partitioned tables because relpages is always -1 for the parent table (while reltuples contains an actual estimate covering all partitions) - tested in Postgres 14.
You have to add up estimates for all partitions instead.

We can do what the Postgres planner does. Quoting the Row Estimation Examples in the manual:

These numbers are current as of the last VACUUM or ANALYZE on the
table. The planner then fetches the actual current number of pages in
the table (this is a cheap operation, not requiring a table scan). If
that is different from relpages then reltuples is scaled
accordingly to arrive at a current number-of-rows estimate.

Postgres uses estimate_rel_size defined in src/backend/utils/adt/plancat.c, which also covers the corner case of no data in pg_class because the relation was never vacuumed. We can do something similar in SQL:

Minimal form

SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM   pg_class
WHERE  oid = 'mytable'::regclass;  -- your table here

Safe and explicit

SELECT (CASE WHEN c.reltuples < 0 THEN NULL       -- never vacuumed
             WHEN c.relpages = 0 THEN float8 '0'  -- empty table
             ELSE c.reltuples / c.relpages END
     * (pg_catalog.pg_relation_size(c.oid)
      / pg_catalog.current_setting('block_size')::int)
       )::bigint
FROM   pg_catalog.pg_class c
WHERE  c.oid = 'myschema.mytable'::regclass;      -- schema-qualified table here

Doesn't break with empty tables and tables that have never seen VACUUM or ANALYZE. The manual on pg_class:

If the table has never yet been vacuumed or analyzed, reltuples contains -1 indicating that the row count is unknown.

If this query returns NULL, run ANALYZE or VACUUM for the table and repeat. (Alternatively, you could estimate row width based on column types like Postgres does, but that's tedious and error-prone.)

If this query returns 0, the table seems to be empty. But I would ANALYZE to make sure. (And maybe check your autovacuum settings.)

Typically, block_size is 8192. current_setting('block_size')::int covers rare exceptions.

Table and schema qualifications make it immune to any search_path and scope.

Either way, the query consistently takes < 0.1 ms for me.

More Web resources:

TABLESAMPLE SYSTEM (n) in Postgres 9.5+

SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);

Like @a_horse commented, the added clause for the SELECT command can be useful if statistics in pg_class are not current enough for some reason. For example:

  • No autovacuum running.
  • Immediately after a large INSERT / UPDATE / DELETE.
  • TEMPORARY tables (which are not covered by autovacuum).

This only looks at a random n % (1 in the example) selection of blocks and counts rows in it. A bigger sample increases the cost and reduces the error, your pick. Accuracy depends on more factors:

  • Distribution of row size. If a given block happens to hold wider than usual rows, the count is lower than usual etc.
  • Dead tuples or a FILLFACTOR occupy space per block. If unevenly distributed across the table, the estimate may be off.
  • General rounding errors.

Typically, the estimate from pg_class will be faster and more accurate.

Answer to actual question

First, I need to know the number of rows in that table, if the total
count is greater than some predefined constant,

And whether it ...

... is possible at the moment the count pass my constant value, it will
stop the counting (and not wait to finish the counting to inform the
row count is greater).

Yes. You can use a subquery with LIMIT:

SELECT count(*) FROM (SELECT 1 FROM token LIMIT 500000) t;

Postgres actually stops counting beyond the given limit, you get an exact and current count for up to n rows (500000 in the example), and n otherwise. Not nearly as fast as the estimate in pg_class, though.

迷荒 2025-02-01 22:40:44

我在Postgres应用程序中进行了一次操作:

EXPLAIN SELECT * FROM foo;

然后使用正则逻辑或类似逻辑检查输出。对于简单的选择 *,第一行输出应该看起来像这样:

Seq Scan on uids  (cost=0.00..1.21 rows=8 width=75)

您可以使用 rows =(\ d+)值作为将返回的行数的粗略估计,然后仅实际选择计数(*)如果估计值小于1.5倍您的阈值(或您认为的任何数字对您的应用程序有意义)。

根据查询的复杂性,此数字可能变得越来越不准确。实际上,在我的应用中,随着我们添加加入和复杂的条件,它变得如此不准确,甚至是完全毫无价值的,甚至知道如何在100范围内我们返回多少行,因此我们不得不放弃该策略。

但是,如果您的查询非常简单,以至于PG可以在某个合理的错误范围内预测其返回多少行,则可能对您有用。

I did this once in a postgres app by running:

EXPLAIN SELECT * FROM foo;

Then examining the output with a regex, or similar logic. For a simple SELECT *, the first line of output should look something like this:

Seq Scan on uids  (cost=0.00..1.21 rows=8 width=75)

You can use the rows=(\d+) value as a rough estimate of the number of rows that would be returned, then only do the actual SELECT COUNT(*) if the estimate is, say, less than 1.5x your threshold (or whatever number you deem makes sense for your application).

Depending on the complexity of your query, this number may become less and less accurate. In fact, in my application, as we added joins and complex conditions, it became so inaccurate it was completely worthless, even to know how within a power of 100 how many rows we'd have returned, so we had to abandon that strategy.

But if your query is simple enough that Pg can predict within some reasonable margin of error how many rows it will return, it may work for you.

枉心 2025-02-01 22:40:44

参考从此博客中。

您可以在下面使用以查询查找行计数。

使用pg_class:

 SELECT reltuples::bigint AS EstimatedCount
    FROM   pg_class
    WHERE  oid = 'public.TableName'::regclass;

使用pg_stat_user_tables:

SELECT 
    schemaname
    ,relname
    ,n_live_tup AS EstimatedCount 
FROM pg_stat_user_tables 
ORDER BY n_live_tup DESC;

Reference taken from this Blog.

You can use below to query to find row count.

Using pg_class:

 SELECT reltuples::bigint AS EstimatedCount
    FROM   pg_class
    WHERE  oid = 'public.TableName'::regclass;

Using pg_stat_user_tables:

SELECT 
    schemaname
    ,relname
    ,n_live_tup AS EstimatedCount 
FROM pg_stat_user_tables 
ORDER BY n_live_tup DESC;
℡寂寞咖啡 2025-02-01 22:40:44

这是实现@Jonathan Hall的答案的功能(不是我的):

CREATE OR REPLACE FUNCTION count_estimate(query text) RETURNS integer AS $
DECLARE
    rec   record;
    rows  integer;
BEGIN
    FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP
            rows := substring(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');
            EXIT WHEN rows IS NOT NULL;
        END LOOP;
    RETURN rows;
END;
$ LANGUAGE plpgsql VOLATILE STRICT;

This is the function (not mine) that implements @Jonathan Hall's answer:

CREATE OR REPLACE FUNCTION count_estimate(query text) RETURNS integer AS $
DECLARE
    rec   record;
    rows  integer;
BEGIN
    FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP
            rows := substring(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');
            EXIT WHEN rows IS NOT NULL;
        END LOOP;
    RETURN rows;
END;
$ LANGUAGE plpgsql VOLATILE STRICT;

别把无礼当个性 2025-02-01 22:40:44

文本列有多大?

有了一个小组,您无能为力避免进行数据扫描(至少进行索引扫描)。

我建议:

  1. 如果可能的话,更改架构以删除文本数据的重复。这样,计数将在“许多”表中的狭窄外键字段上发生。

  2. 或者,创建具有文本哈希的生成列,然后按哈希列组成。
    同样,这是为了减少工作负载(通过狭窄的列索引扫描)

编辑:

您的原始问题与您的编辑不完全匹配。我不确定您是否知道与组一起使用时,计数是否会返回每个组中的项目计数,而不是整个表中的项目计数。

How wide is the text column?

With a GROUP BY there's not much you can do to avoid a data scan (at least an index scan).

I'd recommend:

  1. If possible, changing the schema to remove duplication of text data. This way the count will happen on a narrow foreign key field in the 'many' table.

  2. Alternatively, creating a generated column with a HASH of the text, then GROUP BY the hash column.
    Again, this is to decrease the workload (scan through a narrow column index)

Edit:

Your original question did not quite match your edit. I'm not sure if you're aware that the COUNT, when used with a GROUP BY, will return the count of items per group and not the count of items in the entire table.

戏舞 2025-02-01 22:40:44

在Oracle中,您可以使用 Rownum 来限制返回的行数。我猜其他SQL也存在类似的结构。因此,对于您给出的例子,您可以将返回的行数限制为500001,然后应用 count(*)然后:

SELECT (case when cnt > 500000 then 500000 else cnt end) myCnt
FROM (SELECT count(*) cnt FROM table WHERE rownum<=500001)

In Oracle, you could use rownum to limit the number of rows returned. I am guessing similar construct exists in other SQLs as well. So, for the example you gave, you could limit the number of rows returned to 500001 and apply a count(*) then:

SELECT (case when cnt > 500000 then 500000 else cnt end) myCnt
FROM (SELECT count(*) cnt FROM table WHERE rownum<=500001)
心不设防 2025-02-01 22:40:44

您也可以只有从&lt; table_name&gt; 中选择最大(id);将ID更改为桌子的PK

You can also just SELECT MAX(id) FROM <table_name>; change id to whatever the PK of the table is

单身狗的梦 2025-02-01 22:40:44

对于SQL Server(2005或更高),快速可靠方法是:

SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTableName')   
AND (index_id=0 or index_id=1);

有关sys.dm_dm_db_partition_stats的详细信息,请在 msdn

查询添加了(可能是)分区表的所有部分的行。

index_id = 0是一个无序的表(堆),index_id = 1是一个有序的表(群集索引)

甚至更快(但不可靠的)方法详细在这里。

For SQL Server (2005 or above) a quick and reliable method is:

SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTableName')   
AND (index_id=0 or index_id=1);

Details about sys.dm_db_partition_stats are explained in MSDN

The query adds rows from all parts of a (possibly) partitioned table.

index_id=0 is an unordered table (Heap) and index_id=1 is an ordered table (clustered index)

Even faster (but unreliable) methods are detailed here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文