加速内连接和子查询,同时限制行大小和表成员身份
我正在开发一个 RSS feed 阅读器,它使用贝叶斯过滤器来过滤掉无聊的博客文章。
Stream 表旨在充当 FIFO 缓冲区,Web 应用程序将从中消耗“条目”。我用它来存储条目、用户和贝叶斯过滤器分类之间的临时关系。
用户将条目标记为已读后,它将被添加到元数据表中(以便用户不会看到他们已经阅读过的材料),并从流表中删除。每三分钟,后台进程将用新条目重新填充流表(即每当守护程序在检查 rss feed 更新后添加新条目时)。
问题:我提出的查询非常慢。更重要的是,Stream表一次只需要保存一百个未读条目;它将减少重复,使处理速度更快,并为我显示条目的方式提供一定的灵活性。
查询(在没有索引的 3600 个项目上花费大约 9 秒):
insert into stream (entry_id, user_id)
select entries.id, subscriptions_users.user_id
from entries
inner join subscriptions_users on subscriptions_users.subscription_id = entries.subscription_id
where subscriptions_users.user_id = 1
and entries.id not in (select entry_id
from metadata
where metadata.user_id = 1)
and entries.id not in (select entry_id
from stream where user_id = 1);
查询解释:将用户订阅列表 (subscriptions_users) 中用户尚未读取的(即不存在于元数据中)且其中的所有条目插入到流中流中尚不存在。
尝试的解决方案:在末尾添加 limit 100 可以显着加快查询速度,但是在重复执行时,将继续添加表中尚不存在的不同的 100 个条目集(每个成功的查询花费的时间越来越长)。
这很接近,但不完全是我想做的。
有谁有任何建议(nosql?)或知道更有效的编写查询的方法?
I'm developing an rss feed reader that uses a bayesian filter to filter out boring blog posts.
The Stream table is meant to act as a FIFO buffer from which the webapp will consume 'entries'. I use it to store the temporary relationship between entries, users and bayesian filter classifications.
After a user marks an entry as read, it will be added to the metadata table (so that a user isn't presented with material they have already read), and deleted from the stream table. Every three minutes, a background process will repopulate the Stream table with new entries (i.e. whenever the daemon adds new entries after the checks the rss feeds for updates).
Problem: The query I came up with is hella slow. More importantly, the Stream table only needs to hold one hundred unread entries at a time; it'll reduce duplication, make processing faster and give me some flexibility with how I display the entries.
The query (takes about 9 seconds on 3600 items with no indexes):
insert into stream (entry_id, user_id)
select entries.id, subscriptions_users.user_id
from entries
inner join subscriptions_users on subscriptions_users.subscription_id = entries.subscription_id
where subscriptions_users.user_id = 1
and entries.id not in (select entry_id
from metadata
where metadata.user_id = 1)
and entries.id not in (select entry_id
from stream where user_id = 1);
The query explained: insert into stream all of the entries from a user's subscription list (subscriptions_users) that the user has not read (i.e. do not exist in metadata) and which do not already exist in the stream.
Attempted solution: adding limit 100 to the end speeds up the query considerably, but upon repeated executions will keep on adding a different set of 100 entries that do not already exist in the table (with each successful query taking longer and longer).
This is close but not quite what I wanted to do.
Does anyone have any advice (nosql?) or know a more efficient way of composing the query?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用:
在 MySQL 中,
LEFT JOIN/IS NULL
是获取一个表中存在的数据而不是另一个表中存在的数据的最有效方法。参考文献link在查看索引之前检查查询性能。
在 Postgres 中:
NOT IN
NOT EXISTS
LEFT JOIN / IS NULL
...等效。
Use:
In MySQL, the
LEFT JOIN/IS NULL
is the most efficient means of getting data that exists in one table, but not another. Reference linkCheck the query performance before looking at indexes.
In Postgres:
NOT IN
NOT EXISTS
LEFT JOIN / IS NULL
...are equivalent.
然后我会尝试从一些索引开始...
或 LEFT JOIN NULL (和索引)
Then I would try to start off with some indexes...
OR LEFT JOIN NULL (And Indexes)
优化选择的一种方法是用联接替换子查询。
类似于:
您必须确保左连接的连接条件正确。我不确定你的确切架构是什么,所以我不能。
此外,添加索引也会有所帮助。
One way to optimize the select is to replace the subqueries with joins.
Something like:
You would have to make sure that the join conditions for the left join are correct. I am not sure what your exact schema is, so I can't.
Also, adding indexes would also help.