还有比“数数”更快的事情吗? 对于大桌子?
这是我的查询:
select word_id, count(sentence_id)
from sentence_word
group by word_id
having count(sentence_id) > 100;
表sentenceword包含3个字段,wordid、sentenceid和主键id。 它有 350k+ 行。 这个查询花费了长达 85 秒的时间,我想知道(希望,祈祷?)有一种更快的方法来找到所有超过 100 个句子 ID 的单词 ID。
我尝试取出选择计数部分,然后只执行“具有计数(1)”,但都没有加快速度。
如果您能提供任何帮助,我将不胜感激。 谢谢!
Here is my query:
select word_id, count(sentence_id)
from sentence_word
group by word_id
having count(sentence_id) > 100;
The table sentenceword contains 3 fields, wordid, sentenceid and a primary key id.
It has 350k+ rows.
This query takes a whopping 85 seconds and I'm wondering (hoping, praying?) there is a faster way to find all the wordids that have more than 100 sentenceids.
I've tried taking out the select count part, and just doing 'having count(1)' but neither speeds it up.
I'd appreciate any help you can lend. Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果您还没有,请在 Sentence_id、Word_id 上创建复合索引。
If you don't already have one, create a composite index on sentence_id, word_id.
这有一个问题......表要么有重复的单词/句子对,要么没有。
如果它确实有重复的单词/句子对,您应该使用此代码来获得正确的答案:
如果表没有重复的单词/句子对...那么您不应该计算句子ID,您应该只计算行数。
在这种情况下,您可以仅在 word_id 上创建索引,以获得最佳性能。
There's a problem with this... Either the table has duplicate word/sentence pairs, or it doesn't.
If it does have duplicate word/sentence pairs, you should be using this code to get the correct answer:
If the table does not have duplicate word/sentence pairs... then you shouldn't count sentence_ids, you should just count rows.
In which case, you can create an index on word_id only, for optimum performance.
如果经常执行该查询,并且表很少更新,您可以保留一个包含单词 ID 和相应句子计数的辅助表 - 很难想象除此之外还有任何进一步的优化!
If that query is often performed, and the table rarely updated, you could keep an auxiliary table with word ids and corresponding sentence counts -- hard to think of any further optimization beyond that!
您的查询很好,但它需要一些帮助(索引)才能获得更快的结果。
我手头没有资源(或无法访问 SQL),但我会尽力凭记忆为您提供帮助。
从概念上讲,回答该查询的唯一方法是计算共享相同 word_id 的所有记录。 这意味着查询引擎需要一种快速的方法来查找这些记录。 如果没有 word_id 索引,数据库唯一能做的就是一次遍历表中的一条记录,并持续计算找到的每个不同 word_id 的总计。 这通常需要一个临时表,并且在扫描整个表之前无法发送结果。 不好。
使用 word_id 上的索引,它仍然需要遍历表,因此您会认为这没有多大帮助。 但是,SQL 引擎现在可以计算每个 word_id 的计数,而无需等到表末尾:它可以调度该行以及该 word_id 值的计数(如果它传递了
where
子句) ,或丢弃该行(如果没有); 这将导致服务器上的内存负载降低,可能导致部分响应,并且不再需要临时表。 第二个方面是并行性; 通过 word_id 上的索引,SQL 可以将作业拆分为块,并使用单独的处理器核心并行运行查询(取决于硬件功能和现有工作负载)。这可能足以帮助您的查询; 但您必须尝试查看:
(T-SQL 语法;您没有指定您正在使用哪种 SQL 产品)
如果这还不够(或根本没有帮助),还有其他两种解决方案。
首先,SQL 允许您使用索引视图和其他机制来预先计算 COUNT(*)。 我手头没有详细信息(而且我不经常这样做)。 如果您的数据不经常更改,这将为您提供更快的结果,但会增加复杂性和一些存储空间。
此外,您可能需要考虑将查询结果存储在单独的表中。 仅当数据从不更改,或按精确的时间表更改(例如,凌晨 2 点数据刷新期间),或者数据更改非常少并且您可以在几个小时内接受不完美的结果(您必须安排定期数据刷新); 这在道德上相当于穷人的数据仓库。
确定什么对您有用的最佳方法是运行查询并查看带有和不带有像上面这样的候选索引的查询计划。
Your query is fine, but it needs a bit of help (indexes) to get faster results.
I don't have my resources at hand (or access to SQL), but I'll try to help you from memory.
Conceptually, the only way to answer that query is to count all the records that share the same word_id. That means that the query engine needs a fast way to find those records. Without an index on word_id, the only thing the database can do is go through the table one record at a time and keep running totals of every single distinct word_id it finds. That would usually require a temporary table and no results can be dispatched until the whole table is scanned. Not good.
With an index on word_id, it still has to go through the table, so you would think it wouldn't help much. However, the SQL engine can now compute the count for each word_id without waiting until the end of the table: it can dispatch the row and the count for that value of word_id (if it passes your
where
clause), or discard the row (if it doesn't); that will result in lower memory load on the server, possibly partial responses, and the temporary table is no longer needed. A second aspect is parallelism; with an index on word_id, SQL can split the job in chunks and use separate processor cores to run the query in parallel (depending on hardware capabilities and existing workload).That might be enough to help your query; but you will have to try to see:
(T-SQL syntax; you didn't specify which SQL product you are using)
If that's not enough (or doesn't help at all), there are two other solutions.
First, SQL allows you to precompute the COUNT(*) by using indexed views and other mechanisms. I don't have the details at hand (and I don't do this often). If your data doesn't change often, that would give you faster results but with a cost in complexity and a bit of storage.
Also, you might want to consider storing the results of the query in a separate table. That is practical only if the data never changes, or changes on a precise schedule (say, during a data refresh at 2 in the morning), or if it changes very little and you can live with non perfect results for a few hours (you would have to schedule a periodic data refresh); that's the moral equivalent of a poor-man's data warehouse.
The best way to find out for sure what works for you is to run the query and look at the query plan with and without some candidate indexes like the one above.
令人惊讶的是,有一种更快的方法可以在大型数据集上实现这一目标:
There is, surprisingly, an even faster way to accomplish that on large data sets: