多重连接中的评分算法
我有一个存储在出版物表中的出版物列表。每个出版物
与类别
具有多对多关系,并且与关键字
也具有多对多关系。
给定一份出版物,我想根据使用以下算法计算的分值来查找相关出版物:
- 与其他出版物的每个共享类别计为一分,
- 与其他出版物的每个共享关键字计为一分,
- 分值是以下值的总和:使用前面的步骤计算的分数
我想使用单个查询检索按此分数排序的相关出版物列表。
现在我有这两个查询,它们计算类别和关键字的分数
SELECT c.publication_id, (COUNT(c.category_id)) AS cscore
FROM cat_pub c
WHERE c.category_id IN <list of category ids obtained from the current publication>
GROUP BY c.publication_id
ORDER BY cscore DESC
以及关键字分数
SELECT k.publication_id, (COUNT(k.keyword_id)) AS kscore
FROM key_pub k
WHERE k.keyword IN <list of category ids obtained from the current publication>
GROUP BY k.publication_id
ORDER BY kscore DESC
最后,我需要将结果查询与应该检索出版物的 SELECT
查询连接起来数据(标题、简介等)按分数排序并带有限制条款,以获得与所选出版物相关的最相关的出版物。
目前,我尝试将这两个查询用作联接中的子表:
SELECT mydata.*, (q1.cscore + q2.kscore) AS score
FROM publications p
INNER JOIN (<cscore query>) q1 ON p.id = q1.publication_id
INNER JOIN (<kscore query>) q2 ON p.id = q2.publication_id
ORDER BY score DESC
LIMIT 5
EXPLAIN 向我显示将使用几个临时表。会不会是性能问题?有没有更好的方法来实现这一点?
更新
回答Johan的评论
您的解决方案是错误的。在子查询中使用 LIMIT 子句可能会导致每个限制值的结果不一致。如果我有以下子查询结果(我将显示 11 条记录,但您的查询将仅获取前 10 条记录),
+-------+--------+ +-------+--------+
| p.id | cscore | | p.id | kscore |
+-------+--------+ +-------+--------+
| 27854 | 100 | | 27865 | 100 |
| 27853 | 100 | | 27864 | 100 |
| 27852 | 100 | | 27863 | 100 |
| 27851 | 100 | | 27862 | 100 |
| 27850 | 100 | | 27861 | 100 |
| 27849 | 100 | | 27860 | 100 |
| 27848 | 100 | | 27859 | 100 |
| 27847 | 100 | | 27858 | 100 |
| 27846 | 100 | | 27857 | 100 |
| 27845 | 100 | | 27856 | 100 |
| 27844 | 100 | | 27855 | 100 |
| 1000 | 99 | | 1000 | 99 |
+-------+--------+ +-------+--------+
如果我有 10 条记录,cscore
为 100,并且有 10 条不同的记录,该怎么办 记录的 kscore
为 100,连接将产生一个空集。所以我没有得到任何结果,而 id 1000 的出版物应该是解决方案,并且它已从结果集中排除。
此外,我可以考虑使用 LEFT JOIN 的解决方案,在这种情况下,只会获取左表中的记录,并且每条记录的总分将为 100(因为空 kscore
第二个表中的字段)。同样,结果是错误的,因为最高得分记录应该是 p1000,总分为 198 (= 99 + 99)
您的解决方案无法产生可靠的结果。
I have a list of publications stored in publications table. Each publication
has a many-to-many relation with categories
and also a many-to-many relation with keywords
.
Given a publication I'd like to find related ones based on a score value computed with the following algorithm:
- each shared category with other publications counts as one point
- each shared keyword with other publications counts as one point
- the score value is the sum of the points computed with previous steps
I want to retrieve with a single query the list of related publications ordered by this score.
Now I have these two queries which compute the score for both categories and keyword
SELECT c.publication_id, (COUNT(c.category_id)) AS cscore
FROM cat_pub c
WHERE c.category_id IN <list of category ids obtained from the current publication>
GROUP BY c.publication_id
ORDER BY cscore DESC
and for the keyword score
SELECT k.publication_id, (COUNT(k.keyword_id)) AS kscore
FROM key_pub k
WHERE k.keyword IN <list of category ids obtained from the current publication>
GROUP BY k.publication_id
ORDER BY kscore DESC
Finally I need to JOIN
the resulting query with a SELECT
query which should retrieve publications data (title, intro, etc,) ordering them by score and with a limit clause to get the most relevant publications related to the selected one.
Currently I tried to use these two queries as subtables in a join:
SELECT mydata.*, (q1.cscore + q2.kscore) AS score
FROM publications p
INNER JOIN (<cscore query>) q1 ON p.id = q1.publication_id
INNER JOIN (<kscore query>) q2 ON p.id = q2.publication_id
ORDER BY score DESC
LIMIT 5
EXPLAIN shows me that a couple of temporary table will be used. Could it be a performance problem? Is there any better way to implement this?
Update
To answer to Johan's comment
Your solution is wrong. Use a LIMIT clause in subqueries could lead to inconsistent results with every value for the limit. What if I have the following results for the subqueries (I'll show 11 records, but your query will fetch only the first ten)
+-------+--------+ +-------+--------+
| p.id | cscore | | p.id | kscore |
+-------+--------+ +-------+--------+
| 27854 | 100 | | 27865 | 100 |
| 27853 | 100 | | 27864 | 100 |
| 27852 | 100 | | 27863 | 100 |
| 27851 | 100 | | 27862 | 100 |
| 27850 | 100 | | 27861 | 100 |
| 27849 | 100 | | 27860 | 100 |
| 27848 | 100 | | 27859 | 100 |
| 27847 | 100 | | 27858 | 100 |
| 27846 | 100 | | 27857 | 100 |
| 27845 | 100 | | 27856 | 100 |
| 27844 | 100 | | 27855 | 100 |
| 1000 | 99 | | 1000 | 99 |
+-------+--------+ +-------+--------+
If I have ten record with 100 as cscore
and ten different records with 100 as kscore
the join will produce an empty set. So I'm not getting any result, while the publication with id 1000 should be the solution and it's left out from the result set.
Furthermore I could consider your solution with a LEFT JOIN, in this case only records from the left table will be fetched, and each record will get a total score of 100 (because of the NULL given by the empty kscore
field in the second table). Again, the result is wrong because the highest scored record should be p1000 with a total score of 198 (= 99 + 99)
Your solution cannot produce reliable results.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您只需要子查询中的每个结果 5 个结果。
我认为最好只选择 5 个并在查询中使用它。
将 q1 重写为: 将
q2 重写为:
保留连接不变:
请注意,count(*) 通常是更快的选择,因为它不会测试
null
如果可以有null
code> 值并且不想将这些值包含在计数中,然后显式命名计数(字段)。You only want 5 results each from the subqueries.
I think it is best to only select 5 from then and use that in the query.
Rewrite q1 as:
Rewrite q2 as:
Leave the join as is:
Note that count(*) is usually a faster choice, because it will not test of
null
If you can havenull
values and don't want to include those in the count, then name the count(field) explicitly.