加速内连接和子查询，同时限制行大小和表成员身份

发布于 2024-08-24 16:04:22 字数 1108 浏览 4 评论 0原文

我正在开发一个 RSS feed 阅读器，它使用贝叶斯过滤器来过滤掉无聊的博客文章。

Stream 表旨在充当 FIFO 缓冲区，Web 应用程序将从中消耗“条目”。我用它来存储条目、用户和贝叶斯过滤器分类之间的临时关系。

用户将条目标记为已读后，它将被添加到元数据表中（以便用户不会看到他们已经阅读过的材料），并从流表中删除。每三分钟，后台进程将用新条目重新填充流表（即每当守护程序在检查 rss feed 更新后添加新条目时）。

问题：我提出的查询非常慢。更重要的是，Stream表一次只需要保存一百个未读条目；它将减少重复，使处理速度更快，并为我显示条目的方式提供一定的灵活性。

查询（在没有索引的 3600 个项目上花费大约 9 秒）：

insert into stream (entry_id, user_id) 
select entries.id, subscriptions_users.user_id 
 from entries 
inner join subscriptions_users on subscriptions_users.subscription_id = entries.subscription_id 
where subscriptions_users.user_id = 1 
  and entries.id not in (select entry_id 
                           from metadata 
                          where metadata.user_id = 1) 
  and entries.id not in (select entry_id 
                          from stream where user_id = 1);

查询解释：将用户订阅列表 (subscriptions_users) 中用户尚未读取的（即不存在于元数据中）且其中的所有条目插入到流中流中尚不存在。

尝试的解决方案：在末尾添加 limit 100 可以显着加快查询速度，但是在重复执行时，将继续添加表中尚不存在的不同的 100 个条目集（每个成功的查询花费的时间越来越长）。

这很接近，但不完全是我想做的。

有谁有任何建议（nosql？）或知道更有效的编写查询的方法？

原文

I'm developing an rss feed reader that uses a bayesian filter to filter out boring blog posts.

The Stream table is meant to act as a FIFO buffer from which the webapp will consume 'entries'. I use it to store the temporary relationship between entries, users and bayesian filter classifications.

After a user marks an entry as read, it will be added to the metadata table (so that a user isn't presented with material they have already read), and deleted from the stream table. Every three minutes, a background process will repopulate the Stream table with new entries (i.e. whenever the daemon adds new entries after the checks the rss feeds for updates).

Problem: The query I came up with is hella slow. More importantly, the Stream table only needs to hold one hundred unread entries at a time; it'll reduce duplication, make processing faster and give me some flexibility with how I display the entries.

The query (takes about 9 seconds on 3600 items with no indexes):

insert into stream (entry_id, user_id) 
select entries.id, subscriptions_users.user_id 
 from entries 
inner join subscriptions_users on subscriptions_users.subscription_id = entries.subscription_id 
where subscriptions_users.user_id = 1 
  and entries.id not in (select entry_id 
                           from metadata 
                          where metadata.user_id = 1) 
  and entries.id not in (select entry_id 
                          from stream where user_id = 1);

The query explained: insert into stream all of the entries from a user's subscription list (subscriptions_users) that the user has not read (i.e. do not exist in metadata) and which do not already exist in the stream.

Attempted solution: adding limit 100 to the end speeds up the query considerably, but upon repeated executions will keep on adding a different set of 100 entries that do not already exist in the table (with each successful query taking longer and longer).

This is close but not quite what I wanted to do.

Does anyone have any advice (nosql?) or know a more efficient way of composing the query?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迷途知返 2024-08-31 16:04:22

使用：

INSERT INTO STREAM 
  (entry_id, user_id) 
   SELECT e.id, 
          su.user_id 
     FROM ENTRIES e
     JOIN SUBSCRIPTIONS_USERS su ON su.subscription_id = e.subscription_id 
                                AND su.user_id = 1 
LEFT JOIN METADATA md ON md.entry_id = e.id
                     AND md.user_id = 1
LEFT JOIN STREAM s ON s.entry_id = e.id
                  AND s.user_id = 1
    WHERE md.entry_id IS NULL
      AND s.entry_id IS NULL

在 MySQL 中，LEFT JOIN/IS NULL 是获取一个表中存在的数据而不是另一个表中存在的数据的最有效方法。参考文献link

在查看索引之前检查查询性能。

在 Postgres 中：

NOT IN
NOT EXISTS
LEFT JOIN / IS NULL

...等效。

Use:

INSERT INTO STREAM 
  (entry_id, user_id) 
   SELECT e.id, 
          su.user_id 
     FROM ENTRIES e
     JOIN SUBSCRIPTIONS_USERS su ON su.subscription_id = e.subscription_id 
                                AND su.user_id = 1 
LEFT JOIN METADATA md ON md.entry_id = e.id
                     AND md.user_id = 1
LEFT JOIN STREAM s ON s.entry_id = e.id
                  AND s.user_id = 1
    WHERE md.entry_id IS NULL
      AND s.entry_id IS NULL

In MySQL, the LEFT JOIN/IS NULL is the most efficient means of getting data that exists in one table, but not another. Reference link

Check the query performance before looking at indexes.

In Postgres:

NOT IN
NOT EXISTS
LEFT JOIN / IS NULL

...are equivalent.

回复收藏 0 原文

好倦 2024-08-31 16:04:22

查询（大约需要 9 秒
3600 个没有索引的项目）：

然后我会尝试从一些索引开始...

或 LEFT JOIN NULL （和索引）

SELECT *
FROM TABLEA A LEFT JOIN
    TABLEB B ON A.ID = B. ID
WHERE B.ID IS NULL

The query (takes about 9 seconds on
3600 items with no indexes):

Then I would try to start off with some indexes...

OR LEFT JOIN NULL (And Indexes)

SELECT *
FROM TABLEA A LEFT JOIN
    TABLEB B ON A.ID = B. ID
WHERE B.ID IS NULL

回复收藏 0 原文

你在我安 2024-08-31 16:04:22

优化选择的一种方法是用联接替换子查询。

类似于：

select entries.id, subscriptions_users.user_id
from entries 
inner join subscriptions_users on subscriptions_users.subscription_id = entries.subscription_id 
left join metadata  md on (user_id,entry_id)
left join stream  str on (user_id, entry_id) 
where subscriptions_users.user_id = 1 and where md.user_id is null and str.user_id is null;

您必须确保左连接的连接条件正确。我不确定你的确切架构是什么，所以我不能。

此外，添加索引也会有所帮助。

One way to optimize the select is to replace the subqueries with joins.

Something like:

select entries.id, subscriptions_users.user_id
from entries 
inner join subscriptions_users on subscriptions_users.subscription_id = entries.subscription_id 
left join metadata  md on (user_id,entry_id)
left join stream  str on (user_id, entry_id) 
where subscriptions_users.user_id = 1 and where md.user_id is null and str.user_id is null;

You would have to make sure that the join conditions for the left join are correct. I am not sure what your exact schema is, so I can't.

Also, adding indexes would also help.

回复收藏 0 原文

~没有更多了~