当前位置：文江博客话题详情

如何在php/mysql中实现类似SO的标签系统？

发布于 2024-08-06 16:50:57 字数 285 浏览 7 评论 0原文

我正在用 PHP/MySQL 编写一个网站，我想实现一个类似于 stackoverflow 的标记引擎。我在数据库中有3个相关表： 1. 物品 2. 标签 3. ItemTagMap（将标签映射到项目，n:n 映射）

现在，在搜索页面上，我想显示整个搜索结果（不仅仅是当前页面）的所有标签的不同列表，以便用户可以“细化”他们的通过从该标签列表添加/删除标签来进行搜索。

问题是，这是对数据库的一个相当繁重的查询，并且可能有大量的搜索请求导致不同的结果集，从而产生不同的标签集。

有谁知道如何有效地实施这一点？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情话墙 2024-08-13 16:50:57

在我们进入过早优化模式之前，查看以下查询模板可能会很有用。如果不出意外的话，这可以用作衡量可能优化的有效性的基线。

SELECT T.Tagid, TagInfo.TagName,  COUNT(*)
FROM Items I
JOIN Tags TagInfo ON TagInfo.TagId = T.TagId
JOIN ItemTagMap T  ON I.ItemId = T.ItemId 
--JOIN ItemTagMap T1 ON I.ItemId = T1.ItemId
WHERE I.ItemId IN
  (
      SELECT ItemId 
      FROM Items
      WHERE   -- Some typical initial search criteria
         Title LIKE 'Bug Report%'   -- Or some fulltext filter instead...
         AND  ItemDate > '02/22/2008'
         AND  Status = 'C'
  )
--AND T1.TagId = 'MySql'
GROUP BY T.TagId, TagInfo.TagName
ORDER BY COUNT(*) DESC

子查询是“驱动查询”，即对应于最终用户的初始标准的查询。（有关此多次查询如何适合整体优化流程的详细信息，请参阅下文）
注释的是 T1 上的 JOIN（当选择多个标签时，可能还包括 T2、T3），以及带有 WHERE 子句的关联条件。当用户选择特定标签时，无论是作为初始搜索的一部分还是通过细化，都需要这些。（将这些连接和 where 子句放在子查询中可能会更有效；下面将详细介绍这些内容）

讨论...
“驱动查询”或其变体需要用于两个不同的目的：

1 提供枚举所有关联标签所需的 ItemId 的完整列表。
2 提供前 N 个 ItemId 值（N 为显示页面大小），用于在 Item 表中查找 Item 详细信息。

请注意，完整列表不需要排序（或者它可能会受益于以不同顺序排序），因此第二个列表需要根据用户的选择进行排序（例如按日期，降序或按标题，按字母顺序升序））。另请注意，如果需要任何排序顺序，则查询的成本将意味着处理完整列表（由于 SQL 本身的奇怪优化和/或某些非规范化，SQL 需要“查看”该列表上的最后一条记录），如果它们属于顶部，按排序）。

后一个事实有利于为两个目的使用完全相同的查询，相应的列表可以存储在临时表中。一般流程是快速查找前 N 个项目记录及其详细信息，并立即将其返回给应用程序。然后应用程序可以获得 ajax-fashion 的标签列表以进行细化。该列表将通过类似于上面查询的查询生成，其中子查询被“select * from temportable”替换。 SQL 优化器很有可能决定对该列表进行排序（在某些情况下），让我们让它这样做，而不是事后猜测并显式排序。

需要考虑的另一点是可能将 ItemTagMap 表上的连接带入“驱动查询”内，而不是如上所示。最好这样做，既是为了性能，也是因为它将为第二个目的（显示项目页面）生成正确的列表。

即使在相对普通的硬件上，上述查询/流程也可能会很好地扩展；暂定为 1/2 百万+ 项目，持续的用户搜索可能高达每秒 10 次。关键因素之一是初始搜索标准的选择性。

优化思路

[根据典型的搜索案例和数据统计] 通过将某些 Items 的字段引入（实际上是复制）到 ItemTagMap 表来进行非规范化可能是有意义的。短字段在那里尤其可能受到“欢迎”。
随着数据增长到超过百万个项目，我们可以利用一些标签的典型强相关性（例如：在 SO 中，PHP 通常与 MySql 一起使用，顺便说一句，通常没有充分的理由......），并使用各种技巧。例如，引入“多标签”TagId 可能会使输入逻辑变得更加复杂，但也可以显着减小 Map 大小。

--'不多说了！ --
应根据实际需求和有效的数据统计概况选择合适的架构和优化...

Before we go into premature optimization mode, it may be useful to look into the following query template. If nothing else this could be used as a baseline against which the effectiveness of possible optimizations can be measured.

SELECT T.Tagid, TagInfo.TagName,  COUNT(*)
FROM Items I
JOIN Tags TagInfo ON TagInfo.TagId = T.TagId
JOIN ItemTagMap T  ON I.ItemId = T.ItemId 
--JOIN ItemTagMap T1 ON I.ItemId = T1.ItemId
WHERE I.ItemId IN
  (
      SELECT ItemId 
      FROM Items
      WHERE   -- Some typical initial search criteria
         Title LIKE 'Bug Report%'   -- Or some fulltext filter instead...
         AND  ItemDate > '02/22/2008'
         AND  Status = 'C'
  )
--AND T1.TagId = 'MySql'
GROUP BY T.TagId, TagInfo.TagName
ORDER BY COUNT(*) DESC

The subquery is the "driving query", i.e. the one corresponding to the end-user's initial criteria. (see below for details on how this query, required multiple times may fit in an overall optimized flow)
Commented is the JOIN on T1 (and possibly T2, T3, when several tags are selected), and, with the WHERE clause, the associated criteria. These are needed when the user selects a particular tag, whether as part of the initial search or by refinement. (It may be more efficient to place these joins and where clauses within the sub-query; more on these below)

Discussion...
The "driving query", or a variation thereof is needed for two distinct purposes:

1 to provide the complete list of ItemId which is needed to enumerate all associated tags.
2 to provide the first N ItemId values (N being the display page size), for the purpose of looking up Item detail info in the Item table.

Note that the complete list doesn't need to be sorted (or it may benefit from sorting in a different order), whereby the second list needs to be sorted based on the user's choice (say by Date, descending or by Title, alphabetically ascending). Also note that if there is any sort order required, the cost of the query will imply dealing with the complete list (shy of odd optimization by SQL itself, and/or some denormalization, SQL needs to "see" the last records on that list, in case they belong to the top, sort-wise).

This latter fact, is in favor of having the very same query for both purposes, the corresponding list can be stored in a temporary table. The general flow would be to quickly lookup the top N Item records with their details and returns this to the application at once. The application can then obtain ajax-fashion the list of Tags for refinements. This list would be produce with a query akin the one above, where the subquery is replaced by a "select * from temporaryTable." The odds are good that the SQL optimizer will decide to sort this list (in some cases), let's let it do that, rather than second guessing it and sorting it explicitly.

One other point to consider is to maybe bring the join(s) on ItemTagMap table inside the "driving query" rather that as shown above. It is probably best to do so, both for performance, and because it will produce the right list for the #2 purpose (display of a page of items).

The query/flow described above will likely scale rather well, even on relatively modest hardware; tentatively into the 1/2 Million+ Items, with sustained user searches maybe up to 10 per second. One of the key factor would be the selectivity of the initial search criteria.

Optimization ideas

[Depending on the typical search cases and on the data stats] it may make sense to denormalize by bringing (indeed duplicating) some of Items' fields to the ItemTagMap table. Short fields in particular may be 'welcome' there.
As the data grows in the million+ Items, we could exploit the typically strong correlation of some tags (ex: in SO, PHP often comes with MySql, btw often for no good reason...), with various tricks. For example the introduction of "multi-Tag" TagIds could render the input logic a bit more complicated, but could also reduce the Map size significantly.

-- 'nough said! --
Appropriate architecture and optimizations should be selected in light of the actual requirements and of the effective data statistical profile...

回复收藏 0 原文

恋你朝朝暮暮 2024-08-13 16:50:57

您需要尽量减少数据库调用的数量，将繁重的工作交给 PHP。

首先，从数据库中选择所有项目：

select * from items where (conditions);

然后，从结果集中创建所有 id 的数组。

$ids = array();
foreach ($items as $item) {
    $ids[] = $item['id'];
}
$ids = implode(',' $ids);

然后选择您之前检索到的项目 ID 的所有 ItemTagMap 和关联标签数据。

select map.item_id, t.id, t.name from tags t, item_tag_maps map where t.id = map.tag_id and map.item_id in ($ids);

现在，当您循环遍历 $items 数组时，只要它具有匹配的 item_id 值，您就可以从执行的第二个 SQL 查询中找到所有匹配的标签。

You'll want to try to minimize the number of DB calls, putting the heavy work into PHP.

First, select all your Items from the DB:

select * from items where (conditions);

Then, create an array of all id's from the result set.

$ids = array();
foreach ($items as $item) {
    $ids[] = $item['id'];
}
$ids = implode(',' $ids);

Then select all ItemTagMaps and associated tag data for the Item ID's you previously retrieved.

select map.item_id, t.id, t.name from tags t, item_tag_maps map where t.id = map.tag_id and map.item_id in ($ids);

Now when you loop through your $items array, you can locate all matching tags from the 2nd SQL query you performed as long as it has a matching item_id value.

回复收藏 0 原文

爱你是孤单的心事 2024-08-13 16:50:57

假设：

物品（id）；
带有名称索引的标签（id，名称）；
项目标签（item_id，tag_id）。

然后：

SELECT t.name
FROM Tag t
WHERE EXISTS (SELECT 1 FROM ItemTag WHERE item_id = 1234)
ORDER BY t.name

没什么特别的。这很相似，但我的猜测是它会更慢：

SELECT t.name
FROM Tag t
WHERE t.id IN (SELECT tag_id FROM ItemTag WHERE item_id = 1234)
ORDER BY t.name

这也可以作为连接来完成：

SELECT DISTINCT t.name
FROM Tag t
JOIN ItemTag i WHERE i.tag_id = t.id
WHERE i.item_id = 1234
ORDER BY t.name

我认为第一个会更快，但与 SQL 的情况一样，它值得测试（在足够大小的数据集上）。

完成上述操作是为了列出单个项目的标签。您需要一组复合标签作为搜索结果。从上面的内容来看，这并不困难，但这取决于您如何获得搜索结果。

Assuming:

Item (id);
Tag (id, name) with index on name;
ItemTag (item_id, tag_id).

then:

SELECT t.name
FROM Tag t
WHERE EXISTS (SELECT 1 FROM ItemTag WHERE item_id = 1234)
ORDER BY t.name

Nothing intensive about that. This is similar but my guess is it would be slower:

SELECT t.name
FROM Tag t
WHERE t.id IN (SELECT tag_id FROM ItemTag WHERE item_id = 1234)
ORDER BY t.name

This can be done as a join as well:

SELECT DISTINCT t.name
FROM Tag t
JOIN ItemTag i WHERE i.tag_id = t.id
WHERE i.item_id = 1234
ORDER BY t.name

I think the first one will be faster but as is always the case with SQL, it's worth testing (on a sufficiently sized data set).

The above have been done to list the tags for a single item. You want a composite set of tags for search results. That's not difficult from the above but it depends on how you get your search results.

回复收藏 0 原文

~没有更多了~