我正面临着一个突然弹出的PostgreSQL数据库,面临棘手的错误,我无法在其他地方复制。
该错误突然发生,没有任何已知的维护或升级,似乎与特定数据库上下文有关。
文档
似乎来回走动,以下是在网上搜索错误消息时发现的链接列表:
我检测到的错误是:
SELECT version();
-- PostgreSQL 13.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
我安装的唯一扩展名是:
SELECT extname, extversion FROM pg_extension;
-- "plpgsql" "1.0"
-- "postgis" "3.1.1"
症状
错误主要症状是 找不到变量在子计划列表中:
SELECT COUNT(*) FROM items;
-- ERROR: variable not found in subplan target list
-- SQL state: XX000
不影响所有表,而只是一些特定的表。
有趣的是,它仅部分破裂了:
SELECT COUNT(id) FROM items; -- 213
SELECT COUNT(*) FROM items WHERE id > 0; -- 213
它仅影响 Count(*)
汇总最有可能是因为*
占位符。
此外,错误与查询计划有关,而不是与查询本身有关的问题,因为
EXPLAIN SELECT COUNT(*) FROM item;
-- ERROR: variable not found in subplan target list
-- SQL state: XX000
如果没有实际执行查询,则失败也是如此。
Digging into the PostgreSQL code on GitHub the error message appears here and is related to the function
该指针应解释为什么使用*
占位符而不是显式列名时会发生这种情况。
可重复性
这是一个棘手的错误,仅仅是因为表现出来很困难,并且该错误在某种程度上是恶性的,因为到目前为止,我无法理解哪种情况使它实现。
看来此错误正在在特定的情况下增加(例如,带有等效消息和症状的错误,用 pgroonga
延长),但在我的情况下,我现在无法平行。
在不同的情况下,我可能面临同等问题,但我无法成功捕获一个简单的MCVE来发现它。
CREATE TABLE t AS SELECT CAST(c AS text) FROM generate_series(1, 10000) AS c;
-- SELECT 10000
CREATE INDEX t_c ON t(c);
-- CREATE INDEX
VACUUM t;
-- VACUUM
SELECT COUNT(*) FROM t;
-- 10000
按预期工作。问题的表依赖于 postgis
扩展索引,但是我不能再复制它:
CREATE EXTENSION postgis;
-- CREATE EXTENSION
CREATE TABLE test(
id serial,
geom geometry(Point, 4326)
);
-- CREATE TABLE
INSERT INTO test
SELECT x, ST_MakePoint(x/10000., x/10000.) FROM generate_series(1, 10000) AS x;
-- INSERT 0 10000
CREATE INDEX test_index ON test USING GIST(geom);
-- CREATE INDEX
VACUUM test;
-- VACUUM
SELECT COUNT(*) FROM test;
-- 10000
按预期工作。
当我倾倒并恢复错误的数据库时,问题就会消失。
寻找MCVE
在试图重现该错误以构建MCVE和单元测试以突出显示并将其报告给开发人员时, 。在倾倒数据库并重新创建新实例时,该错误只是消失。
因此,我可以重现此错误的独特方法是使用原始数据库,但是我无法成功准备数据库的转储以在其他地方重现该错误。
这就是全部内容,我正在寻找提示在我的上下文中重现该错误的提示。
在这一点上,我的分析是:
- 该错误与数据库状态或某些恢复转储时相同的元数据有关;
- 当没有过滤子句时,使用
*
通配符时,该错误与 count
函数有关;
- 该错误不是一般的,因为它仅影响具有特定索引的特定表;
- 该错误位于查询计划者方面。
似乎有些元或状态损坏阻止查询计划者找到应用 Count
方法的列名。
问题
我的问题是:如何更深入地调查此错误:
- 可再现(保留它的转储技术);
- 还是可以对开发人员理解(元查询以确定问题在数据库中的位置)?
这将是另一种用语:
- 如何重现使查询计划者崩溃的上下文?
- 有没有办法使计划者更多的详细信息以获取有关错误的更多详细信息?
- 我可以在目录中遇到哪些疑问以捕获错误的上下文?
I am facing a tricky error with a PostgreSQL Database that suddenly popped up and I cannot reproduce elsewhere.
The error happened suddenly without any known maintenance or upgrade and seems to be related to a specific database context.
Documentation
The bug seems to go back and forth, here is a list of links found when searching over the web for the error message:
The product version I have detected the error is:
SELECT version();
-- PostgreSQL 13.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
And the only extensions I have installed are:
SELECT extname, extversion FROM pg_extension;
-- "plpgsql" "1.0"
-- "postgis" "3.1.1"
Symptom
The error main symptom is variable not found in subplan target list:
SELECT COUNT(*) FROM items;
-- ERROR: variable not found in subplan target list
-- SQL state: XX000
And does not affect all tables, just some specific ones.
What is interesting is that it is only partially broken:
SELECT COUNT(id) FROM items; -- 213
SELECT COUNT(*) FROM items WHERE id > 0; -- 213
And it only affect the COUNT(*)
aggregate most probably because of the *
placeholder.
Further more the error is related to the query plan not to the query itself as:
EXPLAIN SELECT COUNT(*) FROM item;
-- ERROR: variable not found in subplan target list
-- SQL state: XX000
Fails as well without actually executing the query.
Digging into the PostgreSQL code on GitHub the error message appears here and is related to the function search_indexed_tlist_for_var
in case it returns nothing.
This pointer should explain why it happens when using *
placeholder instead of an explicit column name.
Reproducibility
It is a tricky bug, simply because showing it exists is difficult and the bug is somehow vicious as by now I cannot understand which conditions make it happen.
It seems this bug is raising in specific context (eg. bug with equivalent message and symptom reported with the PGroonga
extension) but in my case I cannot make a parallel by now.
It is likely I am facing an equivalent problem in a different context but I could not succeed in capturing a simple MCVE to spot it.
CREATE TABLE t AS SELECT CAST(c AS text) FROM generate_series(1, 10000) AS c;
-- SELECT 10000
CREATE INDEX t_c ON t(c);
-- CREATE INDEX
VACUUM t;
-- VACUUM
SELECT COUNT(*) FROM t;
-- 10000
Works as expected. Table having the issue relies on postgis
extension index, but again I cannot reproduce it:
CREATE EXTENSION postgis;
-- CREATE EXTENSION
CREATE TABLE test(
id serial,
geom geometry(Point, 4326)
);
-- CREATE TABLE
INSERT INTO test
SELECT x, ST_MakePoint(x/10000., x/10000.) FROM generate_series(1, 10000) AS x;
-- INSERT 0 10000
CREATE INDEX test_index ON test USING GIST(geom);
-- CREATE INDEX
VACUUM test;
-- VACUUM
SELECT COUNT(*) FROM test;
-- 10000
Works as expected.
And when I dump and restore the faulty database the problem vanishes.
Looking for a MCVE
When trying to reproduce the bug in order to build an MCVE and unit tests to highlight it and report it to developers I face a limitation. When dumping the database and recreating to a new instance, the bug simply vanishes.
So the unique way I can reproduce this bug is using the original database but I could not succeed to prepare a dump of the database to reproduce the bug elsewhere.
This what it is all about, I am looking for hints to reproduce the bug in my context.
At this point my analysis is:
- The bug is related to the database state or to some meta data that is not equally the same when the dump is restored;
- The bug is related to the
COUNT
function when using the *
wildcard when there is no filtering clause;
- The bug is not general as it affects only specific tables with specific index;
- The bug reside at the query planner side.
Seems like some meta or state corruption prevent the query planner to find a column name to apply the COUNT
method.
Question
My question is: How can I deeper investigate this bug to make it:
- either reproducible (a dump technique preserving it);
- or understandable to a developer (meta queries to identify where the problem resides in the database)?
Another way to phrase it would be:
- How can I reproduce the context which is making the query planner crashes?
- Is there a way to make the planner more verbose in order to get more details on the error?
- What queries can I run against the catalog to capture the faulty context?
发布评论