在标记博客文章时我真的应该使用关系表吗?
在尝试弄清楚如何使用单个 sql 语句标记博客文章此处,我想到了以下想法:使用关系表 tag2post 通过 id 引用标签,如下所示是没有必要的:
tags
+-------+-----------+
| tagid | tag |
+-------+-----------+
| 1 | news |
| 2 | top-story |
+-------+-----------+
tag2post
+----+--------+-------+
| id | postid | tagid |
+----+--------+-------+
| 0 | 322 | 1 |
+----+--------+-------+
为什么不只需使用以下模型,在其中索引标签本身如下? 考虑到标签永远不会重命名,而是添加和删除,这可能是有道理的,对吧? 你怎么认为?
tag2post
+----+--------+-------+
| id | postid | tag |
+----+--------+-------+
| 1 | 322 | sun |
+----+--------+-------+
| 2 | 322 | moon |
+----+--------+-------+
| 3 | 4443 | sun |
+----+--------+-------+
| 4 | 2567 | love |
+----+--------+-------+
PS:我保留了一个id,我是为了轻松显示最后添加的n个标签...
while trying to figure out how to tag a blog post with a single sql statement here, the following thought crossed my mind: using a relation table tag2post that references tags by id as follows just isn't necessary:
tags
+-------+-----------+
| tagid | tag |
+-------+-----------+
| 1 | news |
| 2 | top-story |
+-------+-----------+
tag2post
+----+--------+-------+
| id | postid | tagid |
+----+--------+-------+
| 0 | 322 | 1 |
+----+--------+-------+
why not just using the following model, where you index the tag itself as follows? taken that tags are never renamed, but added and removed, this could make sense, right? what do you think?
tag2post
+----+--------+-------+
| id | postid | tag |
+----+--------+-------+
| 1 | 322 | sun |
+----+--------+-------+
| 2 | 322 | moon |
+----+--------+-------+
| 3 | 4443 | sun |
+----+--------+-------+
| 4 | 2567 | love |
+----+--------+-------+
PS: i keep an id, i order to easily display the last n tags added...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
它有效,但没有标准化,因为标签中有冗余。 您还无法使用“相同”标签来标记帖子以外的内容。 对于小N来说,优化并不重要,所以我用它跑就没问题。
实际上,您的索引会更大(假设您要在标签上建立索引以进行搜索,您现在正在对重复项和字符串进行索引)。 在规范化版本中,tags表上的索引会更小,不会有重复,tag2post表上tagid上的索引也会更小。 此外,固定大小的 int 列对于索引非常有效,并且根据您的集群选择,您还可以避免一些碎片。
我知道您说过不重命名,但一般来说,在这两种情况下,您可能仍然需要考虑重命名(甚至删除)标签意味着什么的语义 - 是否所有条目都需要更改,或者标签是否需要更改以某种方式分裂。 因为这是最坏情况下事务中的批量操作(所有 tag2post 都必须重命名),所以从设计角度来看,我并没有真正将其归类为重要的。
It works, but it is not normalized, because you have redundancy in the tags. You also lose the ability to use the "same" tags to tag things besides posts. For small N, optimization doesn't matter, so I have no problems if you run with it.
As a practical matter, your indexes will be larger (assuming you are going to index on tag for searching, you are now indexing duplicates and indexing strings). In the normalized version, the index on the tags table will be smaller, will not have duplicates, and the index on the tag2post table on tagid will be smaller. In addition, the fixed size int columns are very efficient for indexing and you might also avoid some fragmentation depending on your clustering choices.
I know you said no renaming, but in general, in both cases, you might still need to think about the semantics of what it means to rename (or even delete) a tag - do all entries need to be changed, or does the tag get split in some way. Because this is a batch operation in a transaction in the worst case (all the tag2post have to be renamed), I don't really classify it as significant from a design point of view.
这对我来说听起来不错,当您有不同的内容(例如用户名或其他内容)时,使用 ID 来引用委托给另一个表的内容是有意义的,因为您不想在数据库中的每个位置更改它的名称当他改变它的时候。 然而,在这种情况下,标签名称本身不会变化,因此我看到的唯一潜在缺点是文本索引的搜索速度可能比数字索引稍慢。
This sounds fine to me, using an ID to reference something that you delegated into another table makes sense when you have things that vary, say a user's name or whatever, because you don't want to change it's name in every place in your database when he changes it. However in this case the tag names themselves will not vary, so the only potential downside I see is that a text index might be slightly slower than a numeric index to search through.
与包含 ID 的关系表相比,您的提案的真正优势在哪里?
从技术上讲,它们解决了相同的问题,但您提出的解决方案以冗余的、非规范化的方式实现,这似乎只能满足能够直接从关系表读取数据的本能冲动。
数据库服务器非常擅长连接表,如果连接是通过带有索引的 INT 字段进行的,则更是如此。 我认为当您将另一个表(例如:
INT id, VARCHAR(50) TagName
)加入查询时,您不会遇到毁灭性的性能问题。但是您失去了轻松重命名标签的能力(即使您不打算这样做),并且不必要地用冗余数据膨胀了关系表。 随着时间的推移,这可能会比标准化解决方案花费更多的性能。
Where is the real advantage of your proposal over a relation table containing IDs?
Technically they solve the same problem, but your proposed solution does it in a redundant, de-normalized way that only seems to satisfy the instinctive urge to be able to read the data directly from the relation table.
The DB server is pretty good at joining tables, and even more so if the join is over an INT field with an index on it. I don't think you will be facing devastating performance issues when you join another table (like:
INT id, VARCHAR(50) TagName
) to your query.But you lose the ability to easily rename a tag (even if you don't plan on doing so), and you needlessly inflate your relation table with redundant data. Over time, this may cost you more performance than the normalized solution.
根据您的应用程序,非标准化方法可能会很好。
您可能会发现,由于搜索大量 VARCHAR 数据,它会导致性能下降。
当搜索标记为“sun*”(例如太阳、阳光明媚、日出)的内容时
您不需要进行连接。 但是,您需要对更大的 VARCHAR 数据集进行类似比较。 正确的索引可以缓解这个问题,但只有测试才能告诉您哪种方法对您的数据集更快。
您还可以选择添加预连接规范化表的 VIEW。 这为您提供了更简单的查询,同时仍然允许您拥有高度规范化的数据。
我的建议是采用规范化结构(并添加非规范化视图,以方便使用),直到遇到非规范化数据模式修复的问题。
The de-normalised method may be fine depending on your application.
You may find that it causes a performance hit due to searching a large set of VARCHAR data.
When doing a search for things tagged like "sun*" (e.g. sun, sunny, sunrise)
you will not need to do a join. However, you will need to do a like comparison on a MUCH larger set of VARCHAR data. Proper indexing may alleviate this issue but only testing will tell you which method is faster with your dataset.
You also have the option of adding a VIEW that pre-joins the normalised tables. This gives you simpler queries while still allowing you to have highly normalised data.
My recommendation is to go with a normalised structure (and add de-normalised views a necessary for ease of use) until you encounter an issue that de-normalising the data schema fixes.
我也在考虑这一点。 想要数据库中的标签列表,只需从 tag2post 中选择不同的标签即可。 有人告诉我,由于我想优化 select 语句,因此最好使用整数键,因为它比使用字符串快得多。
I was considering that too. Want a list of tags in the database, just select distinct tag from tag2post. I was told that since I wanted to optimize for select statements, it would be better to use an integer key because it was much faster than using a string.