DB 记录属性的整数与字符与 Wordpress 架构
我之前问过类似的问题(integer-vs-char-for-db-record -property),但偶然发现了一些与我在上一篇文章中收到的所有建议相悖的东西。在最流行、最成熟的开源博客脚本 Wordpress 3 中,帖子状态在数据库中存储为 VARCHAR(20)
- 'publish'、'auto-draft'、'inherit'、'pending'等,而不是作为具有查找表或映射字符串常量的 INT
,或 CHAR
或类似的东西。这也适用于字段 post_type
('post'、'attachment'、'revision' 等)和其他一些字段。 因此,要查找所有已发布的帖子,我需要运行类似 SELECT * FROM posts WHERE post_status = 'published' AND post_type = 'post'
的命令。此外,post_status、post_type 和其他一些列上有一个多列索引,这肯定会加快此类搜索的速度。有人可以解释为什么他们这样做而不是其他,以及这种方法的优点和缺点是什么?
I asked a similar question before (integer-vs-char-for-db-record-property) but stumbled upon something that goes against all recommendations that I received in my previous post. In Wordpress 3, the most popular and mature open source blog script, post status is stored as VARCHAR(20)
in db - 'publish', 'auto-draft', 'inherit', 'pending', etc. and not as INT
with lookup table or mapped string constants, or CHAR
, or anything like that. This also applies to the field post_type
('post', 'attachment', 'revision', etc.) and some other fields.
So to find all published posts I need to run something like SELECT * FROM posts WHERE post_status = 'published' AND post_type = 'post'
. Also, there is a multiple column index on post_status, post_type and some other columns which certainly speeds up this kind of search. Can someone explain why they made it this way and not the other, and what are the benefits and drawbacks of this approach?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
仅仅因为某些应用程序众所周知并不意味着它们具有良好的数据库设计。这往往违反规范化规则。也许他们获得了更好的表现,也许他们在选择这个时没有考虑其他可能性,因为他们不知道更好。也许他们是应用程序程序员,在没有很好地理解数据库理论的情况下设计数据库,或者可能是故意用性能统计数据来支持它的去规范化。或者,当我们决定将值从“已发布”更改为其他内容时,他们可能没有想到必须更新 1 亿条记录的可能性。也许他们只测试了选择的性能,而不测试更新的性能。也许这些值本质上不太可能改变,所以非规范化并不是什么大问题。我们无法从这里知道。
Just because some application is well known doesn't mean they had a good database design. This tends to violate the normalization rules. Maybe they get better performance and maybe they didn't look at the other possibilities when they chose this one becasue they didn;t know better. Maybe they were aplication programmers designing a database without understanding database theory very well or maybe it was a deliberate denormailzation with performance stats to back it up. Or maybe they didn't think the chances of having to update 100 million records when we decided we wanted to change the value from 'published' to something else. Maybe they only tested performance on selects but not on updates. Maybe the values genuniely are unlikey to change, so it's not such a big deal to denormalize. We can't know from here.
规范化并不是用数字替换字符串或仅仅因为它们具有相同的字母而“共享”字符串。
我不知道他们的设计,但以下场景是完全规范化的,即使它使用字符串作为标识符。
与代理键相比,使用自然键的主要好处是,它减少了所需的联接数量,并且还减少了仅从索引即可回答整个类查询的可能性。主要缺点是增加了存储空间,并且如果我们需要更改值,则可能会遇到麻烦。
Normalization is not about replacing strings with numbers or "sharing" strings just because they have the same letters.
I don't know their design, but the following scenario is perfectly normalized even though it uses strings as identifiers.
The main benefit of using natural keys over surrogate keys is that it reduces the number of joins needed and also the likelyhood that entire classes of queries can be answered from index only. The main drawbacks are increased storage and the possibility of having a hell if we need to change the values.
我猜想 WP 开发人员只是避免了他们认为过早的优化,而是选择了更好的可读性。
比
当新的 WP 开发人员运行
select * from ...
查询时,数据库表列出“已发布”而不是 3 或 5,这更容易理解并调试。从磁盘存储空间的角度来看,我认为这两种方法都相当不错 - 与博客文章文本相比,更多的
post_status
字节应该没有多大影响。其他栏目。一个整数是 8 个字节(除非它是一个 tinyint),而“已发布”可能是 10 个字节,所以并不重要?I'd guess that the WP devs simply avoided what they felt was premature optimization, and opted for better readability instead.
is a little little bit easier to read than
And when a new WP developer runs a
select * from ...
query, the database table lists 'published' rather than 3 or 5, which is easier to understand and debug.From the point of view of disk storage space, either approach is fairly okay I think — some more
post_status
bytes shouldn't matter much in comparison to the blog post text and all other columns. An integer is 8 bytes (well unless it's a tinyint) and 'published' is perhaps 10 bytes so doesn't matter much?