数据库-设计“事件”桌子
阅读这篇精彩的Nettuts+文章 我提出了一个表模式,它将高度不稳定的数据与其他遭受大量读取的表分开,同时减少整个数据库模式中所需的表数量,但我不是确定这是否是一个好主意,因为它不遵循规范化规则,我想听听您的建议,这是总体思路:
我在 类表继承结构,在主“用户”表中我存储所有用户通用的数据(id
,用户名
,密码
,几个标志
,...)以及一些TIMESTAMP
字段(date_created
、date_updated
、date_activated
、date_lastLogin
、...)。
引用上面提到的 Nettuts+ 文章中的提示 #16:
示例 2:您有一个“last_login” 表中的字段。它更新每个 用户登录网站的时间。 但是表上的每次更新都会导致 该表的查询缓存 酡。你可以将该字段放入 另一个表来保存您的更新 用户表最小化。
现在它变得更加棘手,我需要跟踪一些用户统计信息,例如
- 用户个人资料被查看的唯一次数,
- 来自某个网站的广告唯一次数特定类型的用户被点击
- 唯一次特定类型用户的帖子被看到
- 等等......
在我完全规范化的数据库中,这总共增加了大约 8 到 10 个额外的表,这不是很多,但如果可以的话我想让事情保持简单,所以我想出了以下“ events
”表:
|------|----------------|----------------|---------------------|-----------|
| ID | TABLE | EVENT | DATE | IP |
|------|----------------|----------------|---------------------|-----------|
| 1 | user | login | 2010-04-19 00:30:00 | 127.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 1 | user | login | 2010-04-19 02:30:00 | 127.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | created | 2010-04-19 00:31:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | activated | 2010-04-19 02:34:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | approved | 2010-04-19 09:30:00 | 217.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | login | 2010-04-19 12:00:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | created | 2010-04-19 12:30:00 | 127.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | impressed | 2010-04-19 12:31:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:01 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:02 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:03 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:04 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:05 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | blocked | 2010-04-20 03:19:00 | 217.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | deleted | 2010-04-20 03:20:00 | 217.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
基本上,ID
指的是 TABLE
表中的主键 (id
) 字段,我相信其余的应该很简单。我在此设计中喜欢的一件事是,我可以跟踪所有用户登录而不仅仅是最后一次登录,从而利用该数据生成一些有趣的指标。
由于 events
表的不断增长,我还考虑进行一些优化,例如:
- #9:因为只有有限数量的表和有限的 (和预定的)事件数量,
TABLE
和EVENTS
列可以设置为ENUM
而不是VARCHAR
以节省一些空间。 - #14:使用
INET_ATON()
将IP
存储为UNSIGNED INT
VARCHAR
。 - 将
DATE
存储为TIMESTAMP
而不是DATETIME
。 - 使用
ARCHIVE
(或)引擎而不是CSV
?InnoDB
/ <代码>MyISAM。- 仅支持
INSERT
和SELECT
,并且数据会即时压缩。
- 仅支持
总的来说,每个事件只会消耗 14 个(未压缩)字节,我想这对于我的流量来说是可以的。
优点:
- 能够存储更详细的数据(例如登录信息)。
- 无需设计(和编码)几乎十几个附加表(日期和统计数据)。
- 减少每个表的几列并保持易失性数据分离。
缺点:
- 非关系型(仍然不如 EAV 那么糟糕):
SELECT * FROM events WHERE id = 2 AND table = 'user' ORDER BY date DESC();
- 每个事件 6 字节开销(
ID
,表
和事件
)。
我更倾向于采用这种方法,因为优点似乎远远超过缺点,但我仍然有点不情愿... 我错过了什么吗?您对此有何看法?
谢谢!
@coolgeek:
我稍微做的一件事 不同的是维持一个 entity_type表,并在中使用它的ID object_type 列(在您的情况下, “表”列)。你会想要 对 event_type 做同样的事情 表。
需要明确的是,您的意思是我应该添加一个附加表来映射表中允许的事件,并在事件表中使用该表的 PK,而不是使用 TABLE
/ EVENT配对?
@本:
这些都是统计数据 现有数据,不是吗?
附加表主要与统计信息相关,但数据尚不存在,例如:
user_ad_stats user_post_stats
------------- ---------------
user_ad_id (FK) user_post_id (FK)
ip ip
date date
type (impressed, clicked)
如果我删除这些表,我无法跟踪谁、什么或何时,不确定视图在这里有何帮助。
我同意它应该是分开的, 但更多的是因为它从根本上 不同的数据。某人是什么以及 某人所做的事是两种不同的 事物。我不认为波动性如此 重要。
我听过这两种说法,但我在 MySQL 手册中找不到任何说明其中任何一种是正确的。无论如何,我同意你的观点,它们应该是单独的表,因为它们代表各种数据(具有比常规方法更具描述性的额外好处)。
我认为你错过了森林 可以这么说,树木。
表的谓词是 “DATE 时间 IP IP 的用户 ID EVENTed to TABLE”这似乎 合理,但存在问题。
我所说的“不像 EAV 那么糟糕”是指所有记录都遵循线性结构,并且非常容易查询,没有层次结构,因此所有查询都可以通过简单的 SELECT 来完成。
关于你的第二个说法,我认为你在这里理解错了; IP 地址不一定与用户相关联。表结构应如下所示:
IP 地址 (
IP
) 做了一些事情 (EVENT
) 到 PK (ID
) 日期 (DATE
) 的表格 (TABLE
)。
例如,在上面示例的最后一行中,应该显示 IP 217.0.0.1(某个管理员)在 2010-04-20 03:20:00 删除了用户#2(其最后已知的 IP 是 127.0.0.2) 。
您仍然可以加入用户活动等活动 给用户,但你不能实现 外键约束。
确实,这是我主要关心的问题。然而,我并不完全确定这种设计会出什么问题,而传统的关系设计不会出问题。我可以发现一些警告,但只要干扰数据库的应用程序知道它在做什么,我想应该不会有任何问题。
在这个论点中另一件重要的事情是,我将存储更多的事件,并且每个事件将比原始设计增加一倍以上,在这里使用 ARCHIVE
存储引擎是非常有意义的,唯一的问题是它不支持 FK(既不支持 UPDATE 也不支持 DELETE)。
After reading the tips from this great Nettuts+ article I've come up with a table schema that would separate highly volatile data from other tables subjected to heavy reads and at the same time lower the number of tables needed in the whole database schema, however I'm not sure if this is a good idea since it doesn't follow the rules of normalization and I would like to hear your advice, here is the general idea:
I've four types of users modeled in a Class Table Inheritance structure, in the main "user" table I store data common to all the users (id
, username
, password
, several flags
, ...) along with some TIMESTAMP
fields (date_created
, date_updated
, date_activated
, date_lastLogin
, ...).
To quote the tip #16 from the Nettuts+ article mentioned above:
Example 2: You have a “last_login”
field in your table. It updates every
time a user logs in to the website.
But every update on a table causes the
query cache for that table to be
flushed. You can put that field into
another table to keep updates to your
users table to a minimum.
Now it gets even trickier, I need to keep track of some user statistics like
- how many unique times a user profile was seen
- how many unique times a ad from a specific type of user was clicked
- how many unique times a post from a specific type of user was seen
- and so on...
In my fully normalized database this adds up to about 8 to 10 additional tables, it's not a lot but I would like to keep things simple if I could, so I've come up with the following "events
" table:
|------|----------------|----------------|---------------------|-----------|
| ID | TABLE | EVENT | DATE | IP |
|------|----------------|----------------|---------------------|-----------|
| 1 | user | login | 2010-04-19 00:30:00 | 127.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 1 | user | login | 2010-04-19 02:30:00 | 127.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | created | 2010-04-19 00:31:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | activated | 2010-04-19 02:34:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | approved | 2010-04-19 09:30:00 | 217.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | login | 2010-04-19 12:00:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | created | 2010-04-19 12:30:00 | 127.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | impressed | 2010-04-19 12:31:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:01 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:02 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:03 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:04 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:05 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | blocked | 2010-04-20 03:19:00 | 217.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | deleted | 2010-04-20 03:20:00 | 217.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
Basically the ID
refers to the primary key (id
) field in the TABLE
table, I believe the rest should be pretty straightforward. One thing that I've come to like in this design is that I can keep track of all the user logins instead of just the last one, and thus generate some interesting metrics with that data.
Due to the growing nature of the events
table I also thought of making some optimizations, such as:
- #9: Since there is only a finite number of tables and a finite (and predetermined) number of events, the
TABLE
andEVENTS
columns could be setup asENUM
s instead ofVARCHAR
s to save some space. - #14: Store
IP
s asUNSIGNED INT
s withINET_ATON()
instead ofVARCHAR
s. - Store
DATE
s asTIMESTAMP
s instead ofDATETIME
s. - Use the
ARCHIVE
(or the) engine instead ofCSV
?InnoDB
/MyISAM
.- Only
INSERT
s andSELECT
s are supported, and data is compressed on the fly.
- Only
Overall, each event would only consume 14 (uncompressed) bytes which is okay for my traffic I guess.
Pros:
- Ability to store more detailed data (such as logins).
- No need to design (and code for) almost a dozen additional tables (dates and statistics).
- Reduces a few columns per table and keeps volatile data separated.
Cons:
- Non-relational (still not as bad as EAV):
SELECT * FROM events WHERE id = 2 AND table = 'user' ORDER BY date DESC();
- 6 bytes overhead per event (
ID
,TABLE
andEVENT
).
I'm more inclined to go with this approach since the pros seem to far outweigh the cons, but I'm still a little bit reluctant... Am I missing something? What are your thoughts on this?
Thanks!
@coolgeek:
One thing that I do slightly
differently is to maintain an
entity_type table, and use its ID in
the object_type column (in your case,
the 'TABLE' column). You would want to
do the same thing with an event_type
table.
Just to be clear, you mean I should add an additional table that maps which events are allowed in a table and use the PK of that table in the events table instead of having a TABLE
/ EVENT
pair?
@ben:
These are all statistics derived from
existing data, aren't they?
The additional tables are mostly related to statistics but I the data doesn't already exists, some examples:
user_ad_stats user_post_stats
------------- ---------------
user_ad_id (FK) user_post_id (FK)
ip ip
date date
type (impressed, clicked)
If I drop these tables I've no way to keep track of who, what or when, not sure how views can help here.
I agree that it ought to be separate,
but more because it's fundamentally
different data. What someone is and
what someone does are two different
things. I don't think volatility is so
important.
I've heard it both ways and I couldn't find anything in the MySQL manual that states that either one is right. Anyway, I agree with you that they should be separated tables because they represent kinds of data (with the added benefit of being more descriptive than a regular approach).
I think you're missing the forest for
the trees, so to speak.The predicate for your table would be
"User ID from IP IP at time DATE
EVENTed to TABLE" which seems
reasonable, but there are issues.
What I meant for "not as bad as EAV" is that all records follow a linear structure and they are pretty easy to query, there is no hierarchical structure so all queries can be done with a simple SELECT
.
Regarding your second statement, I think you understood me wrong here; the IP address is not necessarily associated with the user. The table structure should read something like this:
IP address (
IP
) did something
(EVENT
) to the PK (ID
) of the
table (TABLE
) on date (DATE
).
For instance, in the last row of my example above it should read that IP 217.0.0.1 (some admin), deleted the user #2 (whose last known IP is 127.0.0.2) at 2010-04-20 03:20:00.
You can still join, say, user events
to users, but you can't implement a
foreign key constraint.
Indeed, that's my main concern. However I'm not totally sure what can go wrong with this design that couldn't go wrong with a traditional relational design. I can spot some caveats but as long as the app messing with the database knows what it is doing I guess there shouldn't be any problems.
One other thing that counts in this argument is that I will be storing much more events, and each event will more than double compared to the original design, it makes perfect sense to use the ARCHIVE
storage engine here, the only thing is it doesn't support FK
s (neither UPDATE
s or DELETE
s).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我强烈推荐这种方法。由于您可能对 OLTP 和 OLAP 使用相同的数据库,因此通过添加一些星星和雪花可以获得显着的性能优势。
我有一个社交网络应用程序,目前有 65 张桌子。我维护一个表来跟踪对象(博客/帖子、论坛/线程、图库/相册/图像等)视图,另一个表用于对象推荐,第三个表用于总结十几个其他表中的插入/更新活动。
我做的稍微不同的一件事是维护一个entity_type表,并在object_type列中使用它的ID(在您的例子中,是“TABLE”列)。您可能希望对 event_type 表执行相同的操作。
为 Alix 澄清 - 是的,您维护一个对象的引用表和一个事件的引用表(这些将是您的维度表)。您的事实表将包含以下字段:
I highly recommend this approach. Since you're presumably using the same database for OLTP and OLAP, you can gain significant performance benefits by adding in some stars and snowflakes.
I have a social networking app that is currently at 65 tables. I maintain a single table to track object (blog/post, forum/thread, gallery/album/image, etc) views, another for object recommends, and a third table to summarize insert/update activity in a dozen other tables.
One thing that I do slightly differently is to maintain an entity_type table, and use its ID in the object_type column (in your case, the 'TABLE' column). You would want to do the same thing with an event_type table.
Clarifying for Alix - Yes, you maintain a reference table for objects, and a reference table for events (these would be your dimension tables). Your fact table would have the following fields:
它看起来是一个非常合理的设计,所以我只是想挑战您的一些假设,以确保您对所做的事情有具体的理由。
这些都是根据现有数据得出的统计数据,不是吗? (更新:好吧,它们不是,所以请忽略以下内容。)为什么这些不只是视图,甚至是具体化视图?
收集这些统计数据似乎是一个缓慢的操作,但是:
我猜你正在谈论如何将非常不稳定的用户(仅选择一个表)事件与用户数据分离。我同意它应该是分开的,但更多的是因为它是根本不同的数据。某人是什么和某人做什么是两件不同的事情。
我认为波动性并不是那么重要。 DBMS 应该已经允许您将日志文件和数据库文件放在不同的设备上,这可以完成相同的任务,并且争用不应该成为行级锁定的问题。
可以这么说,我认为你只见树木不见森林。
您的表的谓词将是“用户 ID 从 IP IP 在时间 DATE EVENTed 到表”,这似乎合理,但存在问题。 (更新:好吧,有点像那样。)
您仍然可以将用户事件连接到用户,但无法实现外键约束。这就是 EAV 通常存在问题的原因;某件事是否完全是 EAV 并不重要。在模式中实现约束通常需要一两行代码,但在应用程序中可能需要数十行代码,并且如果多个应用程序在多个位置访问相同的数据,则它可以轻松增加到数千行代码行。因此,一般来说,如果您可以通过外键约束来防止不良数据,那么就可以保证没有应用程序会这样做。
您可能认为事件并不那么重要,但举个例子,广告展示次数就是金钱。我肯定希望在设计过程中尽早发现与广告印象相关的任何错误。
进一步评论
通过一些注意事项,您可以构建一个非常成功的系统。有了适当的约束系统,您就可以说,“如果任何干扰数据库的应用程序不知道它在做什么,DBMS 将标记一个错误。”这可能需要比你更多的时间和金钱,所以你能拥有的更简单的东西可能比你不能拥有的更完美的东西更好。这就是生活。
It looks like a pretty reasonable design, so I just wanted to challenge a few of your assumptions to make sure you had concrete reasons for what you're doing.
These are all statistics derived from existing data, aren't they? (Update: okay, they're not, so disregard following.) Why wouldn't these simply be views, or even materialized views?
It may seem like a slow operation to gather those statistics, however:
I guess you're talking about how the user (just to pick one table) events, which would be pretty volatile, are separated from the user data. I agree that it ought to be separate, but more because it's fundamentally different data. What someone is and what someone does are two different things.
I don't think volatility is so important. The DBMS should already allow you to put the log file and database file on separate devices, which accomplishes the same thing, and contention shouldn't be an issue with row-level locking.
I think you're missing the forest for the trees, so to speak.
The predicate for your table would be "User ID from IP IP at time DATE EVENTed to TABLE" which seems reasonable, but there are issues. (Update: Okay, so it's sort of kinda like that.)
You can still join, say, user events to users, but you can't implement a foreign key constraint. That's why EAV is generally problematic; whether or not something is exactly EAV doesn't really matter. It's generally one or two lines of code to implement a constraint in your schema, but in your app it could be dozens of lines of code, and if the same data is accessed in multiple places by multiple apps, it can easily multiply to thousands of lines of code. So, generally, if you can prevent bad data with a foreign key constraint, you're guaranteed that no app will do that.
You might think that events aren't so important, but, as an example, ad impressions are money. I would definitely want to catch any bugs relating to ad impressions as early in the design process as possible.
Further comment
And with some caveats you can make a very successful system. With a proper system of constraints, you get to say, "if any app messing with the database doesn't know what it's doing, the DBMS will flag an error." That may require a more time and money than you've got, so something simpler that you can have is probably better than something more perfect that you can't. C'est la vie.
我无法对 Ben 的答案添加评论,所以有两件事......
首先,在独立的 OLAP/DSS 数据库中使用视图是一回事;在交易数据库中使用它们是另一回事。高性能 MySQL 人员建议不要使用视图 我同意,性能很重要的是
WRT 数据完整性,这是使用带有“事件”的星形或雪花作为中心事实表(以及像我一样使用多个事件表)的另一个优势。但您无法围绕 IP 地址设计参照完整性方案
I can't add a comment to Ben's answer, so two things...
First, it would be one thing to use views in a standalone OLAP/DSS database; it's quite another to use them in your transaction database. The High Performance MySQL people recommend against using views where performance matters
WRT data integrity, I agree, and that's another advantage to using a star or snowflake with 'events' as the central fact table (as well as using multiple event tables, like I do). But you cannot design a referential integrity scheme around IP addresses