对看起来简单的 postgresql 查询的算法改进
高级:我可以根据sum
执行order by
、group by
吗 更快吗? (PG 8.4,fwiw.,在一个非小表上......想想O(百万行))
假设我有一个像这样的表:
Table "public.summary"
Column | Type | Modifiers
-------------+-------------------+------------------------------------------------------
ts | integer | not null default nextval('summary_ts_seq'::regclass)
field1 | character varying | not null
otherfield | character varying | not null
country | character varying | not null
lookups | integer | not null
Indexes:
"summary_pk" PRIMARY KEY, btree (ts, field1, otherfield, country)
"ix_summary_country" btree (country)
"ix_summary_field1" btree (field1)
"ix_summary_otherfield" btree (otherfield)
"ix_summary_ts" btree (ts)
并且我想要的查询是:(
select summary.field1,
summary.country,
summary.ts,
sum(summary.lookups) as lookups,
from summary
where summary.country = 'za' and
summary.ts = 1275177600
group by summary.field1, summary.country, summary.ts
order by summary.ts, lookups desc, summary.field1
limit 100;
英语:在特定(的前100个field1) ts,country) 其中“topness”是总和 任何匹配行的查找次数,无论其他字段的值如何)
我真的可以做些什么来加快速度吗?算法上 这似乎是全表扫描之类的事情,但我可能会遗漏一些东西。
High-level: Can I do this order by
, group by
based on sum
any faster? (PG 8.4, fwiw., on a non-tiny table .... think O(millions of rows) )
Suppose I had a table like this:
Table "public.summary"
Column | Type | Modifiers
-------------+-------------------+------------------------------------------------------
ts | integer | not null default nextval('summary_ts_seq'::regclass)
field1 | character varying | not null
otherfield | character varying | not null
country | character varying | not null
lookups | integer | not null
Indexes:
"summary_pk" PRIMARY KEY, btree (ts, field1, otherfield, country)
"ix_summary_country" btree (country)
"ix_summary_field1" btree (field1)
"ix_summary_otherfield" btree (otherfield)
"ix_summary_ts" btree (ts)
And the query I want is:
select summary.field1,
summary.country,
summary.ts,
sum(summary.lookups) as lookups,
from summary
where summary.country = 'za' and
summary.ts = 1275177600
group by summary.field1, summary.country, summary.ts
order by summary.ts, lookups desc, summary.field1
limit 100;
(English: top 100 field1's at a particular (ts,country) where 'topness' is the sum
of lookups for any matching row, regardless of value of otherfield)
Is there anything I can really do to speed this up? Algorithmically
this seems to be a full table scan kind of thing, but I might be missing something.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
此查询的任何查询计划都必须扫描与 WHERE 条件匹配的每一行,并按分组条件将它们汇总 - 也就是说,工作量与分组依据的输入行数成正比,而不是与分组依据的输入行数成正比。结果行。
对于这样的查询,最有效的查询计划是单个索引扫描。如果您按照 (country, ts) 的顺序建立索引,这应该是可能的;使用该索引,这种形式的每个可能的查询都会解析为索引上的连续范围。不过,这仍然需要内存中排序 - 可以使用不同的索引来避免这种情况。
不过,正如其他人所说,发布执行计划是最好的选择。
Any query plan for this query will have to scan every row that matches the WHERE conditions, rolling them up by the grouping conditions - that is, the amount of work is proportional to the number of input rows to the group by, not the number of result rows.
The most efficient query plan possible for a query like this is a single index scan. This ought to be possible if you build an index on (country, ts) in that order; with that index, every possible query of this form resolves to a contiguous range over the index. This will still require an in-memory sort, though - it may be possible to avoid this with a different index.
As others have said, though, posting an execution plan is your best option.
为了能够提出任何建议,您应该发布查询的执行计划。
“OMG Ponies”是对的:limit 100 会将整体结果限制为 100 行,它不适用于个别组!
Postgres Wiki 中有一篇很好的文章,解释了如何发布与慢速查询相关的问题:
http: //wiki.postgresql.org/wiki/SlowQueryQuestions
In order to be able to suggest anything, you should post the execution plan of the query.
And "OMG Ponies" is right: limit 100 will limit the overall result to 100 rows, it will not work on individual groups!
There is a nice article in the Postgres Wiki that explains how to post a question related to a slow query:
http://wiki.postgresql.org/wiki/SlowQueryQuestions
(country, ts) 上的索引是最好的选择(就像 Nick Johnson 所建议的那样),此外您可能还想提高 work_mem 如果它设置得不是很高。如果需要,您可以在运行时设置它(如果设置得非常高,则推荐)。它将有助于将您的排序保留在内存中,并且不会溢出到磁盘(如果发生这种情况)。
要获得真正的帮助,我们需要查看 EXPLAIN ANALYZE,将其发布在explain.depesz.com 上可以使其非常具有可读性。
Index on (country, ts) is a best bet (like Nick Johnson suggests), and additionally you may want to raise work_mem if its not set very high. You can SET this at runtime if needed (and if making it very high, then recommended). It will help keep your sorts in memory and not spill to disk (if thats happening).
For real help, we'll need to see an EXPLAIN ANALYZE, posting it on explain.depesz.com can make it very readable.