快速发现 PostgreSQL 中表的行数的方法

发布于 2024-12-12 12:13:26 字数 417 浏览 0 评论 0原文

我需要知道表中的行数才能计算百分比。如果总计数大于某个预定义的常数,我将使用该常数值。否则,我将使用实际行数。

我可以使用SELECT count(*) FROM table。但是,如果我的常量值为 500,000 并且我的表中有 5,000,000,000 行,计算所有行将浪费大量时间。

是否可以一旦超过我的常数值就停止计数?

只要行数低于给定的限制,我就需要确切的行数。否则,如果计数高于限制,我将使用限制值并希望尽快得到答案。

像这样的事情:

SELECT text,count(*), percentual_calculus()  
FROM token  
GROUP BY text  
ORDER BY count DESC;

I need to know the number of rows in a table to calculate a percentage. If the total count is greater than some predefined constant, I will use the constant value. Otherwise, I will use the actual number of rows.

I can use SELECT count(*) FROM table. But if my constant value is 500,000 and I have 5,000,000,000 rows in my table, counting all rows will waste a lot of time.

Is it possible to stop counting as soon as my constant value is surpassed?

I need the exact number of rows only as long as it's below the given limit. Otherwise, if the count is above the limit, I use the limit value instead and want the answer as fast as possible.

Something like this:

SELECT text,count(*), percentual_calculus()  
FROM token  
GROUP BY text  
ORDER BY count DESC;

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

甜心小果奶 2024-12-19 12:13:26

众所周知,在 PostgreSQL 中计算大表中的行数很慢。 MVCC 模型需要对活动行进行完整计数以获得精确的数字。如果计数不必必须像您的情况那样精确,可以使用一些解决方法来大大加快速度

(请记住,即使是“精确”计数,在并发写入负载下也可能在到达时失效。)

对于大表,精确计数


对于并发写入操作,它可能在您获得它的那一刻就已经过时了。

SELECT count(*) AS exact_count FROM myschema.mytable;
估计

SELECT reltuples AS estimate FROM pg_class where relname = 'mytable';

通常情况下,估计值非常接近。有多接近,取决于 ANALYZEVACUUM 运行得足够 - 其中“足够” “由写入活动的级别定义你的桌子。

更安全的估计

上面忽略了一个数据库中具有相同名称的多个表的可能性 - 在不同的模式中。考虑到这一点:

SELECT c.reltuples::bigint AS estimate
FROM   pg_class c
JOIN   pg_namespace n ON n.oid = c.relnamespace
WHERE  c.relname = 'mytable'
AND    n.nspname = 'myschema';

转换为 bigint 可以很好地格式化 real 数字,尤其是对于大计数。

更好的估计

SELECT reltuples::bigint AS estimate
FROM   pg_class
WHERE  oid = 'myschema.mytable'::regclass;

更快、更简单、更安全、更优雅。请参阅对象标识符类型手册。

在 Postgres 9.4+ 中,将 'myschema.mytable'::regclass 替换为 to_regclass('myschema.mytable') 不会得到任何结果,而是出现无效表名的异常。请参阅:

更好的估计(只需很少的额外成本)

不适用于分区表,因为对于父表,relpages 始终为 -1(而 reltuples 包含覆盖所有分区的实际估计值) - 在 Postgres 14 中进行测试。
您必须将所有分区的估计值相加。

我们可以做 Postgres 规划器所做的事情。引用行估计示例 手册中

这些数字是截至上次 VACUUMANALYZE 上的最新数字
桌子。然后规划器获取当前实际的页数
表(这是一个廉价的操作,不需要表扫描)。如果
relpages 不同,然后 reltuples 被缩放
相应地得出当前的行数估计。

Postgres 使用 src/backend/ 中定义的 estimate_rel_size utils/adt/plancat.c,它还涵盖了 pg_class 中没有数据的极端情况,因为关系从未被清理过。我们可以在 SQL 中做类似的事情:

最小形式

SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM   pg_class
WHERE  oid = 'mytable'::regclass;  -- your table here

安全且显式

SELECT (CASE WHEN c.reltuples < 0 THEN NULL       -- never vacuumed
             WHEN c.relpages = 0 THEN float8 '0'  -- empty table
             ELSE c.reltuples / c.relpages END
     * (pg_catalog.pg_relation_size(c.oid)
      / pg_catalog.current_setting('block_size')::int)
       )::bigint
FROM   pg_catalog.pg_class c
WHERE  c.oid = 'myschema.mytable'::regclass;      -- schema-qualified table here

不会破坏空表和从未见过 VACUUMANALYZE 的表。 pg_class 的手册

如果表从未被清理或分析,reltuples 包含 -1 表示行计数未知。

如果此查询返回 NULL,请对表运行 ANALYZEVACUUM 并重复。 (或者,您可以像 Postgres 一样根据列类型估计行宽,但这很乏味且容易出错。)

如果此查询返回 0,则表似乎是空的。但我会分析来确定。 (也许还可以检查您的 autovacuum 设置。)

通常,block_size 为 8192。current_setting('block_size')::int 涵盖罕见的例外情况。

表和架构限定使其不受任何 search_path 和范围的影响。

无论哪种方式,查询始终采用 <对我来说是 0.1 毫秒。

更多网络资源:

TABLESAMPLE SYSTEM (n) in Postgres 9.5+

SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);

Like @a_horse 评论了,如果 pg_class 中的统计信息不够最新,则为 SELECT 命令添加的子句可能会很有用由于某种原因。例如:

  • 没有运行 autovacuum
  • 紧接着一个大的INSERT / UPDATE / DELETE
  • TEMPORARY 表(autovacuum 未涵盖)。

这仅查看随机 n %(示例中为 1)选择的块并计算其中的行数。更大的样本会增加成本并减少错误,您可以选择。准确性取决于更多因素:

  • 行大小的分布。如果给定的块碰巧容纳比平常更宽的行,则计数会低于平常等。
  • 死元组或 FILLFACTOR 占用每个块的空间。如果表中分布不均匀,估计值可能会偏离。
  • 一般舍入误差。

通常,pg_class 的估计会更快、更准确。

回答实际问题

首先,我需要知道该表中的行数,如果总计
count 大于某个预定义常数,

以及是否...

...有可能在计数超过我的常数值时,它会
停止点票(而不是等到点票结束才通知
行数更大)。

是的。您可以使用带有LIMIT子查询:

SELECT count(*) FROM (SELECT 1 FROM token LIMIT 500000) t;

Postgres实际上会停止计数超过给定的限制,您会得到一个精确当前计数最多为 n 行(示例中为 500000 行),否则为 n 行。不过,远没有 pg_class 中估计的那么快。

Counting rows in big tables is known to be slow in PostgreSQL. The MVCC model requires a full count of live rows for a precise number. There are workarounds to speed this up dramatically if the count does not have to be exact like it seems to be in your case.

(Remember that even an "exact" count is potentially dead on arrival under concurrent write load.)

Exact count

Slow for big tables.
With concurrent write operations, it may be outdated the moment you get it.

SELECT count(*) AS exact_count FROM myschema.mytable;
Estimate

Extremely fast:

SELECT reltuples AS estimate FROM pg_class where relname = 'mytable';

Typically, the estimate is very close. How close, depends on whether ANALYZE or VACUUM are run enough - where "enough" is defined by the level of write activity to your table.

Safer estimate

The above ignores the possibility of multiple tables with the same name in one database - in different schemas. To account for that:

SELECT c.reltuples::bigint AS estimate
FROM   pg_class c
JOIN   pg_namespace n ON n.oid = c.relnamespace
WHERE  c.relname = 'mytable'
AND    n.nspname = 'myschema';

The cast to bigint formats the real number nicely, especially for big counts.

Better estimate

SELECT reltuples::bigint AS estimate
FROM   pg_class
WHERE  oid = 'myschema.mytable'::regclass;

Faster, simpler, safer, more elegant. See the manual on Object Identifier Types.

Replace 'myschema.mytable'::regclass with to_regclass('myschema.mytable') in Postgres 9.4+ to get nothing instead of an exception for invalid table names. See:

Better estimate yet (for very little added cost)

This does not work for partitioned tables because relpages is always -1 for the parent table (while reltuples contains an actual estimate covering all partitions) - tested in Postgres 14.
You have to add up estimates for all partitions instead.

We can do what the Postgres planner does. Quoting the Row Estimation Examples in the manual:

These numbers are current as of the last VACUUM or ANALYZE on the
table. The planner then fetches the actual current number of pages in
the table (this is a cheap operation, not requiring a table scan). If
that is different from relpages then reltuples is scaled
accordingly to arrive at a current number-of-rows estimate.

Postgres uses estimate_rel_size defined in src/backend/utils/adt/plancat.c, which also covers the corner case of no data in pg_class because the relation was never vacuumed. We can do something similar in SQL:

Minimal form

SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM   pg_class
WHERE  oid = 'mytable'::regclass;  -- your table here

Safe and explicit

SELECT (CASE WHEN c.reltuples < 0 THEN NULL       -- never vacuumed
             WHEN c.relpages = 0 THEN float8 '0'  -- empty table
             ELSE c.reltuples / c.relpages END
     * (pg_catalog.pg_relation_size(c.oid)
      / pg_catalog.current_setting('block_size')::int)
       )::bigint
FROM   pg_catalog.pg_class c
WHERE  c.oid = 'myschema.mytable'::regclass;      -- schema-qualified table here

Doesn't break with empty tables and tables that have never seen VACUUM or ANALYZE. The manual on pg_class:

If the table has never yet been vacuumed or analyzed, reltuples contains -1 indicating that the row count is unknown.

If this query returns NULL, run ANALYZE or VACUUM for the table and repeat. (Alternatively, you could estimate row width based on column types like Postgres does, but that's tedious and error-prone.)

If this query returns 0, the table seems to be empty. But I would ANALYZE to make sure. (And maybe check your autovacuum settings.)

Typically, block_size is 8192. current_setting('block_size')::int covers rare exceptions.

Table and schema qualifications make it immune to any search_path and scope.

Either way, the query consistently takes < 0.1 ms for me.

More Web resources:

TABLESAMPLE SYSTEM (n) in Postgres 9.5+

SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);

Like @a_horse commented, the added clause for the SELECT command can be useful if statistics in pg_class are not current enough for some reason. For example:

  • No autovacuum running.
  • Immediately after a large INSERT / UPDATE / DELETE.
  • TEMPORARY tables (which are not covered by autovacuum).

This only looks at a random n % (1 in the example) selection of blocks and counts rows in it. A bigger sample increases the cost and reduces the error, your pick. Accuracy depends on more factors:

  • Distribution of row size. If a given block happens to hold wider than usual rows, the count is lower than usual etc.
  • Dead tuples or a FILLFACTOR occupy space per block. If unevenly distributed across the table, the estimate may be off.
  • General rounding errors.

Typically, the estimate from pg_class will be faster and more accurate.

Answer to actual question

First, I need to know the number of rows in that table, if the total
count is greater than some predefined constant,

And whether it ...

... is possible at the moment the count pass my constant value, it will
stop the counting (and not wait to finish the counting to inform the
row count is greater).

Yes. You can use a subquery with LIMIT:

SELECT count(*) FROM (SELECT 1 FROM token LIMIT 500000) t;

Postgres actually stops counting beyond the given limit, you get an exact and current count for up to n rows (500000 in the example), and n otherwise. Not nearly as fast as the estimate in pg_class, though.

别在捏我脸啦 2024-12-19 12:13:26

我在 postgres 应用程序中通过运行执行此操作:

EXPLAIN SELECT * FROM foo;

然后使用正则表达式或类似逻辑检查输出。对于简单的 SELECT *,输出的第一行应如下所示:

Seq Scan on uids  (cost=0.00..1.21 rows=8 width=75)

您可以使用 rows=(\d+) 值作为将返回的行数的粗略估计,然后仅当估计值小于阈值的 1.5 倍(或您认为对您的应用程序有意义的任何数字)时,才执行实际的 SELECT COUNT(*)

根据查询的复杂程度,该数字可能会变得越来越不准确。事实上,在我的应用程序中,当我们添加联接和复杂条件时,它变得非常不准确,甚至完全没有价值,甚至知道我们会返回多少行,因此我们不得不放弃该策略。

但是,如果您的查询足够简单,Pg 可以在合理的误差范围内预测它将返回多少行,那么它可能适合您。

I did this once in a postgres app by running:

EXPLAIN SELECT * FROM foo;

Then examining the output with a regex, or similar logic. For a simple SELECT *, the first line of output should look something like this:

Seq Scan on uids  (cost=0.00..1.21 rows=8 width=75)

You can use the rows=(\d+) value as a rough estimate of the number of rows that would be returned, then only do the actual SELECT COUNT(*) if the estimate is, say, less than 1.5x your threshold (or whatever number you deem makes sense for your application).

Depending on the complexity of your query, this number may become less and less accurate. In fact, in my application, as we added joins and complex conditions, it became so inaccurate it was completely worthless, even to know how within a power of 100 how many rows we'd have returned, so we had to abandon that strategy.

But if your query is simple enough that Pg can predict within some reasonable margin of error how many rows it will return, it may work for you.

流绪微梦 2024-12-19 12:13:26

参考来自此博客。

您可以使用下面的查询来查找行数。

使用 pg_class:

 SELECT reltuples::bigint AS EstimatedCount
    FROM   pg_class
    WHERE  oid = 'public.TableName'::regclass;

使用 pg_stat_user_tables:

SELECT 
    schemaname
    ,relname
    ,n_live_tup AS EstimatedCount 
FROM pg_stat_user_tables 
ORDER BY n_live_tup DESC;

Reference taken from this Blog.

You can use below to query to find row count.

Using pg_class:

 SELECT reltuples::bigint AS EstimatedCount
    FROM   pg_class
    WHERE  oid = 'public.TableName'::regclass;

Using pg_stat_user_tables:

SELECT 
    schemaname
    ,relname
    ,n_live_tup AS EstimatedCount 
FROM pg_stat_user_tables 
ORDER BY n_live_tup DESC;
她比我温柔 2024-12-19 12:13:26

这是实现 @Jonathan Hall 答案的函数(不是我的):

CREATE OR REPLACE FUNCTION count_estimate(query text) RETURNS integer AS $
DECLARE
    rec   record;
    rows  integer;
BEGIN
    FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP
            rows := substring(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');
            EXIT WHEN rows IS NOT NULL;
        END LOOP;
    RETURN rows;
END;
$ LANGUAGE plpgsql VOLATILE STRICT;

This is the function (not mine) that implements @Jonathan Hall's answer:

CREATE OR REPLACE FUNCTION count_estimate(query text) RETURNS integer AS $
DECLARE
    rec   record;
    rows  integer;
BEGIN
    FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP
            rows := substring(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');
            EXIT WHEN rows IS NOT NULL;
        END LOOP;
    RETURN rows;
END;
$ LANGUAGE plpgsql VOLATILE STRICT;

落在眉间の轻吻 2024-12-19 12:13:26

文本栏有多宽?

使用 GROUP BY 时,您无法采取太多措施来避免数据扫描(至少是索引扫描)。

我建议:

  1. 如果可能,更改架构以删除重复的文本数据。这样,计数将发生在“many”表中的窄外键字段上。

  2. 或者,使用文本的哈希创建生成列,然后按哈希列进行分组。
    同样,这是为了减少工作量(扫描狭窄的列索引)

编辑:

您原来的问题与您的编辑不太匹配。我不确定您是否知道 COUNT 与 GROUP BY 一起使用时,将返回每个组的项目数,而不是整个表中的项目数。

How wide is the text column?

With a GROUP BY there's not much you can do to avoid a data scan (at least an index scan).

I'd recommend:

  1. If possible, changing the schema to remove duplication of text data. This way the count will happen on a narrow foreign key field in the 'many' table.

  2. Alternatively, creating a generated column with a HASH of the text, then GROUP BY the hash column.
    Again, this is to decrease the workload (scan through a narrow column index)

Edit:

Your original question did not quite match your edit. I'm not sure if you're aware that the COUNT, when used with a GROUP BY, will return the count of items per group and not the count of items in the entire table.

梦里°也失望 2024-12-19 12:13:26

在 Oracle 中,您可以使用 rownum 来限制返回的行数。我猜测其他 SQL 中也存在类似的构造。因此,对于您给出的示例,您可以将返回的行数限制为 500001 并应用 count(*) 然后:

SELECT (case when cnt > 500000 then 500000 else cnt end) myCnt
FROM (SELECT count(*) cnt FROM table WHERE rownum<=500001)

In Oracle, you could use rownum to limit the number of rows returned. I am guessing similar construct exists in other SQLs as well. So, for the example you gave, you could limit the number of rows returned to 500001 and apply a count(*) then:

SELECT (case when cnt > 500000 then 500000 else cnt end) myCnt
FROM (SELECT count(*) cnt FROM table WHERE rownum<=500001)
夏见 2024-12-19 12:13:26

您也可以直接SELECT MAX(id) FROM;将 id 更改为表的 PK 值

You can also just SELECT MAX(id) FROM <table_name>; change id to whatever the PK of the table is

无所的.畏惧 2024-12-19 12:13:26

对于 SQL Server(2005 或更高版本),快速且可靠的方法是:

SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTableName')   
AND (index_id=0 or index_id=1);

MSDN

该查询从(可能)分区表的所有部分添加行。

index_id=0 是无序表(堆),index_id=1 是有序表(聚集索引)

甚至更快(但不可靠)的方法有详细介绍 这里。

For SQL Server (2005 or above) a quick and reliable method is:

SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTableName')   
AND (index_id=0 or index_id=1);

Details about sys.dm_db_partition_stats are explained in MSDN

The query adds rows from all parts of a (possibly) partitioned table.

index_id=0 is an unordered table (Heap) and index_id=1 is an ordered table (clustered index)

Even faster (but unreliable) methods are detailed here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文