多重连接中的评分算法

发布于 2024-12-04 03:24:45 字数 2446 浏览 3 评论 0原文

我有一个存储在出版物表中的出版物列表。每个出版物与类别具有多对多关系，并且与关键字也具有多对多关系。

给定一份出版物，我想根据使用以下算法计算的分值来查找相关出版物：

与其他出版物的每个共享类别计为一分，
与其他出版物的每个共享关键字计为一分，
分值是以下值的总和：使用前面的步骤计算的分数

我想使用单个查询检索按此分数排序的相关出版物列表。

现在我有这两个查询，它们计算类别和关键字的分数

SELECT c.publication_id, (COUNT(c.category_id)) AS cscore
FROM cat_pub c
WHERE c.category_id IN <list of category ids obtained from the current publication>
GROUP BY c.publication_id
ORDER BY cscore DESC

以及关键字分数

SELECT k.publication_id, (COUNT(k.keyword_id)) AS kscore
FROM key_pub k
WHERE k.keyword IN <list of category ids obtained from the current publication>
GROUP BY k.publication_id
ORDER BY kscore DESC

最后，我需要将结果查询与应该检索出版物的 SELECT 查询连接起来数据（标题、简介等）按分数排序并带有限制条款，以获得与所选出版物相关的最相关的出版物。

目前，我尝试将这两个查询用作联接中的子表：

SELECT mydata.*, (q1.cscore + q2.kscore) AS score
FROM publications p
INNER JOIN (<cscore query>) q1 ON p.id = q1.publication_id
INNER JOIN (<kscore query>) q2 ON p.id = q2.publication_id
ORDER BY score DESC
LIMIT 5

EXPLAIN 向我显示将使用几个临时表。会不会是性能问题？有没有更好的方法来实现这一点？

更新

回答Johan的评论

您的解决方案是错误的。在子查询中使用 LIMIT 子句可能会导致每个限制值的结果不一致。如果我有以下子查询结果（我将显示 11 条记录，但您的查询将仅获取前 10 条记录），

+-------+--------+ +-------+--------+
| p.id  | cscore | | p.id  | kscore |
+-------+--------+ +-------+--------+
| 27854 | 100    | | 27865 | 100    |
| 27853 | 100    | | 27864 | 100    |
| 27852 | 100    | | 27863 | 100    |
| 27851 | 100    | | 27862 | 100    |
| 27850 | 100    | | 27861 | 100    |
| 27849 | 100    | | 27860 | 100    |
| 27848 | 100    | | 27859 | 100    |
| 27847 | 100    | | 27858 | 100    |
| 27846 | 100    | | 27857 | 100    |
| 27845 | 100    | | 27856 | 100    |
| 27844 | 100    | | 27855 | 100    |
| 1000  | 99     | | 1000  | 99     |
+-------+--------+ +-------+--------+

如果我有 10 条记录，cscore 为 100，并且有 10 条不同的记录，该怎么办 记录的 kscore 为 100，连接将产生一个空集。所以我没有得到任何结果，而 id 1000 的出版物应该是解决方案，并且它已从结果集中排除。

此外，我可以考虑使用 LEFT JOIN 的解决方案，在这种情况下，只会获取左表中的记录，并且每条记录的总分将为 100（因为空 kscore 第二个表中的字段）。同样，结果是错误的，因为最高得分记录应该是 p1000，总分为 198 (= 99 + 99)

您的解决方案无法产生可靠的结果。

原文

I have a list of publications stored in publications table. Each publication has a many-to-many relation with categories and also a many-to-many relation with keywords.

Given a publication I'd like to find related ones based on a score value computed with the following algorithm:

each shared category with other publications counts as one point
each shared keyword with other publications counts as one point
the score value is the sum of the points computed with previous steps

I want to retrieve with a single query the list of related publications ordered by this score.

Now I have these two queries which compute the score for both categories and keyword

SELECT c.publication_id, (COUNT(c.category_id)) AS cscore
FROM cat_pub c
WHERE c.category_id IN <list of category ids obtained from the current publication>
GROUP BY c.publication_id
ORDER BY cscore DESC

and for the keyword score

SELECT k.publication_id, (COUNT(k.keyword_id)) AS kscore
FROM key_pub k
WHERE k.keyword IN <list of category ids obtained from the current publication>
GROUP BY k.publication_id
ORDER BY kscore DESC

Finally I need to JOIN the resulting query with a SELECT query which should retrieve publications data (title, intro, etc,) ordering them by score and with a limit clause to get the most relevant publications related to the selected one.

Currently I tried to use these two queries as subtables in a join:

SELECT mydata.*, (q1.cscore + q2.kscore) AS score
FROM publications p
INNER JOIN (<cscore query>) q1 ON p.id = q1.publication_id
INNER JOIN (<kscore query>) q2 ON p.id = q2.publication_id
ORDER BY score DESC
LIMIT 5

EXPLAIN shows me that a couple of temporary table will be used. Could it be a performance problem? Is there any better way to implement this?

Update

To answer to Johan's comment

Your solution is wrong. Use a LIMIT clause in subqueries could lead to inconsistent results with every value for the limit. What if I have the following results for the subqueries (I'll show 11 records, but your query will fetch only the first ten)

+-------+--------+ +-------+--------+
| p.id  | cscore | | p.id  | kscore |
+-------+--------+ +-------+--------+
| 27854 | 100    | | 27865 | 100    |
| 27853 | 100    | | 27864 | 100    |
| 27852 | 100    | | 27863 | 100    |
| 27851 | 100    | | 27862 | 100    |
| 27850 | 100    | | 27861 | 100    |
| 27849 | 100    | | 27860 | 100    |
| 27848 | 100    | | 27859 | 100    |
| 27847 | 100    | | 27858 | 100    |
| 27846 | 100    | | 27857 | 100    |
| 27845 | 100    | | 27856 | 100    |
| 27844 | 100    | | 27855 | 100    |
| 1000  | 99     | | 1000  | 99     |
+-------+--------+ +-------+--------+

If I have ten record with 100 as cscore and ten different records with 100 as kscore the join will produce an empty set. So I'm not getting any result, while the publication with id 1000 should be the solution and it's left out from the result set.

Furthermore I could consider your solution with a LEFT JOIN, in this case only records from the left table will be fetched, and each record will get a total score of 100 (because of the NULL given by the empty kscore field in the second table). Again, the result is wrong because the highest scored record should be p1000 with a total score of 198 (= 99 + 99)

Your solution cannot produce reliable results.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

醉梦枕江山 2024-12-11 03:24:45

您只需要子查询中的每个结果 5 个结果。
我认为最好只选择 5 个并在查询中使用它。

将 q1 重写为：将

SELECT c.publication_id, COUNT(*) AS cscore
FROM cat_pub c
WHERE c.publication_id = p.id  
AND c.category_id IN <list of category ids obtained from the current publication>
GROUP BY c.publication_id
ORDER BY cscore DESC
LIMIT 10

q2 重写为：

SELECT k.publication_id, COUNT(*) AS kscore
FROM key_pub k
WHERE p.id = k.publication_id
  AND k.keyword IN <list of category ids obtained from the current publication>
GROUP BY k.publication_id
ORDER BY kscore DESC
LIMIT 10

保留连接不变：

SELECT p.*, (q1.cscore + q2.kscore) AS score
FROM publications p
INNER JOIN (<cscore query>) q1 ON p.id = q1.publication_id
INNER JOIN (<kscore query>) q2 ON p.id = q2.publication_id
ORDER BY score DESC
LIMIT 5

请注意，count(*) 通常是更快的选择，因为它不会测试 null 如果可以有 null code> 值并且不想将这些值包含在计数中，然后显式命名计数（字段）。

You only want 5 results each from the subqueries.
I think it is best to only select 5 from then and use that in the query.

Rewrite q1 as:

SELECT c.publication_id, COUNT(*) AS cscore
FROM cat_pub c
WHERE c.publication_id = p.id  
AND c.category_id IN <list of category ids obtained from the current publication>
GROUP BY c.publication_id
ORDER BY cscore DESC
LIMIT 10

Rewrite q2 as:

SELECT k.publication_id, COUNT(*) AS kscore
FROM key_pub k
WHERE p.id = k.publication_id
  AND k.keyword IN <list of category ids obtained from the current publication>
GROUP BY k.publication_id
ORDER BY kscore DESC
LIMIT 10

Leave the join as is:

SELECT p.*, (q1.cscore + q2.kscore) AS score
FROM publications p
INNER JOIN (<cscore query>) q1 ON p.id = q1.publication_id
INNER JOIN (<kscore query>) q2 ON p.id = q2.publication_id
ORDER BY score DESC
LIMIT 5

Note that count(*) is usually a faster choice, because it will not test of null If you can have null values and don't want to include those in the count, then name the count(field) explicitly.

回复收藏 0 原文

~没有更多了~