数据库-设计“事件”桌子

发布于 2024-08-29 11:01:42 字数 6900 浏览 10 评论 0原文

阅读这篇精彩的Nettuts+文章 我提出了一个表模式，它将高度不稳定的数据与其他遭受大量读取的表分开，同时减少整个数据库模式中所需的表数量，但我不是确定这是否是一个好主意，因为它不遵循规范化规则，我想听听您的建议，这是总体思路：

我在类表继承结构，在主“用户”表中我存储所有用户通用的数据（id ，用户名，密码，几个标志，...）以及一些TIMESTAMP字段（date_created、date_updated、date_activated、date_lastLogin、...）。

引用上面提到的 Nettuts+ 文章中的提示 #16：

示例 2：您有一个“last_login” 表中的字段。它更新每个用户登录网站的时间。但是表上的每次更新都会导致该表的查询缓存酡。你可以将该字段放入另一个表来保存您的更新用户表最小化。

现在它变得更加棘手，我需要跟踪一些用户统计信息，例如

用户个人资料被查看的唯一次数，
来自某个网站的广告唯一次数特定类型的用户被点击
唯一次特定类型用户的帖子被看到
等等......

在我完全规范化的数据库中，这总共增加了大约 8 到 10 个额外的表，这不是很多，但如果可以的话我想让事情保持简单，所以我想出了以下“ events”表：

|------|----------------|----------------|---------------------|-----------|
| ID   | TABLE          | EVENT          | DATE                | IP        | 
|------|----------------|----------------|---------------------|-----------|
| 1    | user           | login          | 2010-04-19 00:30:00 | 127.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 1    | user           | login          | 2010-04-19 02:30:00 | 127.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 2    | user           | created        | 2010-04-19 00:31:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 2    | user           | activated      | 2010-04-19 02:34:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 2    | user           | approved       | 2010-04-19 09:30:00 | 217.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 2    | user           | login          | 2010-04-19 12:00:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15   | user_ads       | created        | 2010-04-19 12:30:00 | 127.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 15   | user_ads       | impressed      | 2010-04-19 12:31:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15   | user_ads       | clicked        | 2010-04-19 12:31:01 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15   | user_ads       | clicked        | 2010-04-19 12:31:02 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15   | user_ads       | clicked        | 2010-04-19 12:31:03 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15   | user_ads       | clicked        | 2010-04-19 12:31:04 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15   | user_ads       | clicked        | 2010-04-19 12:31:05 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 2    | user           | blocked        | 2010-04-20 03:19:00 | 217.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 2    | user           | deleted        | 2010-04-20 03:20:00 | 217.0.0.1 |
|------|----------------|----------------|---------------------|-----------|

基本上，ID 指的是 TABLE 表中的主键 (id) 字段，我相信其余的应该很简单。我在此设计中喜欢的一件事是，我可以跟踪所有用户登录而不仅仅是最后一次登录，从而利用该数据生成一些有趣的指标。

由于 events 表的不断增长，我还考虑进行一些优化，例如：

#9：因为只有有限数量的表和有限的 (和预定的）事件数量，TABLE 和 EVENTS 列可以设置为 ENUM 而不是 VARCHAR以节省一些空间。
#14：使用 INET_ATON() 将 IP 存储为 UNSIGNED INT VARCHAR。
将DATE存储为TIMESTAMP而不是DATETIME。
使用ARCHIVE（~~或CSV？~~）引擎而不是InnoDB / <代码>MyISAM。
- 仅支持 INSERT 和 SELECT，并且数据会即时压缩。

总的来说，每个事件只会消耗 14 个（未压缩）字节，我想这对于我的流量来说是可以的。

优点：

能够存储更详细的数据（例如登录信息）。
无需设计（和编码）几乎十几个附加表（日期和统计数据）。
减少每个表的几列并保持易失性数据分离。

缺点：

非关系型（仍然不如 EAV 那么糟糕）：
- SELECT * FROM events WHERE id = 2 AND table = 'user' ORDER BY date DESC();
每个事件 6 字节开销（ID, 表 和事件）。

我更倾向于采用这种方法，因为优点似乎远远超过缺点，但我仍然有点不情愿... 我错过了什么吗？您对此有何看法？

谢谢！

@coolgeek：

我稍微做的一件事不同的是维持一个 entity_type表，并在中使用它的ID object_type 列（在您的情况下， “表”列）。你会想要对 event_type 做同样的事情表。

需要明确的是，您的意思是我应该添加一个附加表来映射表中允许的事件，并在事件表中使用该表的 PK，而不是使用 TABLE / EVENT配对？

@本：

这些都是统计数据现有数据，不是吗？

附加表主要与统计信息相关，但数据尚不存在，例如：

user_ad_stats                          user_post_stats
-------------                          ---------------
user_ad_id (FK)                        user_post_id (FK)
ip                                     ip
date                                   date
type (impressed, clicked)

如果我删除这些表，我无法跟踪谁、什么或何时，不确定视图在这里有何帮助。

我同意它应该是分开的，但更多的是因为它从根本上不同的数据。某人是什么以及某人所做的事是两种不同的事物。我不认为波动性如此重要。

我听过这两种说法，但我在 MySQL 手册中找不到任何说明其中任何一种是正确的。无论如何，我同意你的观点，它们应该是单独的表，因为它们代表各种数据（具有比常规方法更具描述性的额外好处）。

我认为你错过了森林可以这么说，树木。
表的谓词是 “DATE 时间 IP IP 的用户 ID EVENTed to TABLE”这似乎合理，但存在问题。

我所说的“不像 EAV 那么糟糕”是指所有记录都遵循线性结构，并且非常容易查询，没有层次结构，因此所有查询都可以通过简单的 SELECT 来完成。

关于你的第二个说法，我认为你在这里理解错了； IP 地址不一定与用户相关联。表结构应如下所示：

IP 地址 (IP) 做了一些事情 (EVENT) 到 PK (ID) 日期 (DATE) 的表格 (TABLE)。

例如，在上面示例的最后一行中，应该显示 IP 217.0.0.1（某个管理员）在 2010-04-20 03:20:00 删除了用户#2（其最后已知的 IP 是 127.0.0.2）。

您仍然可以加入用户活动等活动给用户，但你不能实现外键约束。

确实，这是我主要关心的问题。然而，我并不完全确定这种设计会出什么问题，而传统的关系设计不会出问题。我可以发现一些警告，但只要干扰数据库的应用程序知道它在做什么，我想应该不会有任何问题。

在这个论点中另一件重要的事情是，我将存储更多的事件，并且每个事件将比原始设计增加一倍以上，在这里使用 ARCHIVE 存储引擎是非常有意义的，唯一的问题是它不支持 FK（既不支持 UPDATE 也不支持 DELETE）。

原文

After reading the tips from this great Nettuts+ article I've come up with a table schema that would separate highly volatile data from other tables subjected to heavy reads and at the same time lower the number of tables needed in the whole database schema, however I'm not sure if this is a good idea since it doesn't follow the rules of normalization and I would like to hear your advice, here is the general idea:

I've four types of users modeled in a Class Table Inheritance structure, in the main "user" table I store data common to all the users (id, username, password, several flags, ...) along with some TIMESTAMP fields (date_created, date_updated, date_activated, date_lastLogin, ...).

To quote the tip #16 from the Nettuts+ article mentioned above:

Example 2: You have a “last_login”
field in your table. It updates every
time a user logs in to the website.
But every update on a table causes the
query cache for that table to be
flushed. You can put that field into
another table to keep updates to your
users table to a minimum.

Now it gets even trickier, I need to keep track of some user statistics like

how many unique times a user profile was seen
how many unique times a ad from a specific type of user was clicked
how many unique times a post from a specific type of user was seen
and so on...

In my fully normalized database this adds up to about 8 to 10 additional tables, it's not a lot but I would like to keep things simple if I could, so I've come up with the following "events" table:

|------|----------------|----------------|---------------------|-----------|
| ID   | TABLE          | EVENT          | DATE                | IP        | 
|------|----------------|----------------|---------------------|-----------|
| 1    | user           | login          | 2010-04-19 00:30:00 | 127.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 1    | user           | login          | 2010-04-19 02:30:00 | 127.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 2    | user           | created        | 2010-04-19 00:31:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 2    | user           | activated      | 2010-04-19 02:34:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 2    | user           | approved       | 2010-04-19 09:30:00 | 217.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 2    | user           | login          | 2010-04-19 12:00:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15   | user_ads       | created        | 2010-04-19 12:30:00 | 127.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 15   | user_ads       | impressed      | 2010-04-19 12:31:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15   | user_ads       | clicked        | 2010-04-19 12:31:01 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15   | user_ads       | clicked        | 2010-04-19 12:31:02 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15   | user_ads       | clicked        | 2010-04-19 12:31:03 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15   | user_ads       | clicked        | 2010-04-19 12:31:04 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15   | user_ads       | clicked        | 2010-04-19 12:31:05 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 2    | user           | blocked        | 2010-04-20 03:19:00 | 217.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 2    | user           | deleted        | 2010-04-20 03:20:00 | 217.0.0.1 |
|------|----------------|----------------|---------------------|-----------|

Basically the ID refers to the primary key (id) field in the TABLE table, I believe the rest should be pretty straightforward. One thing that I've come to like in this design is that I can keep track of all the user logins instead of just the last one, and thus generate some interesting metrics with that data.

Due to the growing nature of the events table I also thought of making some optimizations, such as:

#9: Since there is only a finite number of tables and a finite (and predetermined) number of events, the TABLE and EVENTS columns could be setup as ENUMs instead of VARCHARs to save some space.
#14: Store IPs as UNSIGNED INTs with INET_ATON() instead of VARCHARs.
Store DATEs as TIMESTAMPs instead of DATETIMEs.
Use the ARCHIVE (~~or the CSV?~~) engine instead of InnoDB / MyISAM.
- Only INSERTs and SELECTs are supported, and data is compressed on the fly.

Overall, each event would only consume 14 (uncompressed) bytes which is okay for my traffic I guess.

Pros:

Ability to store more detailed data (such as logins).
No need to design (and code for) almost a dozen additional tables (dates and statistics).
Reduces a few columns per table and keeps volatile data separated.

Cons:

Non-relational (still not as bad as EAV):
- SELECT * FROM events WHERE id = 2 AND table = 'user' ORDER BY date DESC();
6 bytes overhead per event (ID, TABLE and EVENT).

I'm more inclined to go with this approach since the pros seem to far outweigh the cons, but I'm still a little bit reluctant... Am I missing something? What are your thoughts on this?

Thanks!

@coolgeek:

One thing that I do slightly
differently is to maintain an
entity_type table, and use its ID in
the object_type column (in your case,
the 'TABLE' column). You would want to
do the same thing with an event_type
table.

Just to be clear, you mean I should add an additional table that maps which events are allowed in a table and use the PK of that table in the events table instead of having a TABLE / EVENT pair?

@ben:

These are all statistics derived from
existing data, aren't they?

The additional tables are mostly related to statistics but I the data doesn't already exists, some examples:

user_ad_stats                          user_post_stats
-------------                          ---------------
user_ad_id (FK)                        user_post_id (FK)
ip                                     ip
date                                   date
type (impressed, clicked)

If I drop these tables I've no way to keep track of who, what or when, not sure how views can help here.

I agree that it ought to be separate,
but more because it's fundamentally
different data. What someone is and
what someone does are two different
things. I don't think volatility is so
important.

I've heard it both ways and I couldn't find anything in the MySQL manual that states that either one is right. Anyway, I agree with you that they should be separated tables because they represent kinds of data (with the added benefit of being more descriptive than a regular approach).

I think you're missing the forest for
the trees, so to speak.
The predicate for your table would be
"User ID from IP IP at time DATE
EVENTed to TABLE" which seems
reasonable, but there are issues.

What I meant for "not as bad as EAV" is that all records follow a linear structure and they are pretty easy to query, there is no hierarchical structure so all queries can be done with a simple SELECT.

Regarding your second statement, I think you understood me wrong here; the IP address is not necessarily associated with the user. The table structure should read something like this:

IP address (IP) did something
(EVENT) to the PK (ID) of the
table (TABLE) on date (DATE).

For instance, in the last row of my example above it should read that IP 217.0.0.1 (some admin), deleted the user #2 (whose last known IP is 127.0.0.2) at 2010-04-20 03:20:00.

You can still join, say, user events
to users, but you can't implement a
foreign key constraint.

Indeed, that's my main concern. However I'm not totally sure what can go wrong with this design that couldn't go wrong with a traditional relational design. I can spot some caveats but as long as the app messing with the database knows what it is doing I guess there shouldn't be any problems.

One other thing that counts in this argument is that I will be storing much more events, and each event will more than double compared to the original design, it makes perfect sense to use the ARCHIVE storage engine here, the only thing is it doesn't support FKs (neither UPDATEs or DELETEs).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烈酒灼喉 2024-09-05 11:01:42

我强烈推荐这种方法。由于您可能对 OLTP 和 OLAP 使用相同的数据库，因此通过添加一些星星和雪花可以获得显着的性能优势。

我有一个社交网络应用程序，目前有 65 张桌子。我维护一个表来跟踪对象（博客/帖子、论坛/线程、图库/相册/图像等）视图，另一个表用于对象推荐，第三个表用于总结十几个其他表中的插入/更新活动。

我做的稍微不同的一件事是维护一个entity_type表，并在object_type列中使用它的ID（在您的例子中，是“TABLE”列）。您可能希望对 event_type 表执行相同的操作。

为 Alix 澄清 - 是的，您维护一个对象的引用表和一个事件的引用表（这些将是您的维度表）。您的事实表将包含以下字段：

id
object_id
event_id
event_time
ip_address

I highly recommend this approach. Since you're presumably using the same database for OLTP and OLAP, you can gain significant performance benefits by adding in some stars and snowflakes.

I have a social networking app that is currently at 65 tables. I maintain a single table to track object (blog/post, forum/thread, gallery/album/image, etc) views, another for object recommends, and a third table to summarize insert/update activity in a dozen other tables.

One thing that I do slightly differently is to maintain an entity_type table, and use its ID in the object_type column (in your case, the 'TABLE' column). You would want to do the same thing with an event_type table.

Clarifying for Alix - Yes, you maintain a reference table for objects, and a reference table for events (these would be your dimension tables). Your fact table would have the following fields:

id
object_id
event_id
event_time
ip_address

回复收藏 0 原文

北城挽邺 2024-09-05 11:01:42

它看起来是一个非常合理的设计，所以我只是想挑战您的一些假设，以确保您对所做的事情有具体的理由。

在我完全规范化的数据库中，
加起来大约有 8 到 10 个额外的
表格

这些都是根据现有数据得出的统计数据，不是吗？（更新：好吧，它们不是，所以请忽略以下内容。）为什么这些不只是视图，甚至是具体化视图？

收集这些统计数据似乎是一个缓慢的操作，但是：

正确的索引可以使其变得非常快，
这不是常见的操作，因此速度并不重要，
消除冗余数据可能会使其他常见操作快速可靠

我想出了一个表模式
将分离高度不稳定的数据
从其他表受到重
读取

我猜你正在谈论如何将非常不稳定的用户（仅选择一个表）事件与用户数据分离。我同意它应该是分开的，但更多的是因为它是根本不同的数据。某人是什么和某人做什么是两件不同的事情。

我认为波动性并不是那么重要。 DBMS 应该已经允许您将日志文件和数据库文件放在不同的设备上，这可以完成相同的任务，并且争用不应该成为行级锁定的问题。

非关系型（仍然没有那么糟糕）
EAV）

可以这么说，我认为你只见树木不见森林。

您的表的谓词将是“用户 ID 从 IP IP 在时间 DATE EVENTed 到表”，这似乎合理，但存在问题。（更新：好吧，有点像那样。）

您仍然可以将用户事件连接到用户，但无法实现外键约束。这就是 EAV 通常存在问题的原因；某件事是否完全是 EAV 并不重要。在模式中实现约束通常需要一两行代码，但在应用程序中可能需要数十行代码，并且如果多个应用程序在多个位置访问相同的数据，则它可以轻松增加到数千行代码行。因此，一般来说，如果您可以通过外键约束来防止不良数据，那么就可以保证没有应用程序会这样做。

您可能认为事件并不那么重要，但举个例子，广告展示次数就是金钱。我肯定希望在设计过程中尽早发现与广告印象相关的任何错误。

进一步评论

我可以发现一些警告，但只要
应用程序弄乱了数据库
我猜知道它在做什么
应该不会有什么问题。

通过一些注意事项，您可以构建一个非常成功的系统。有了适当的约束系统，您就可以说，“如果任何干扰数据库的应用程序不知道它在做什么，DBMS 将标记一个错误。”这可能需要比你更多的时间和金钱，所以你能拥有的更简单的东西可能比你不能拥有的更完美的东西更好。这就是生活。

It looks like a pretty reasonable design, so I just wanted to challenge a few of your assumptions to make sure you had concrete reasons for what you're doing.

In my fully normalized database this
adds up to about 8 to 10 additional
tables

These are all statistics derived from existing data, aren't they? (Update: okay, they're not, so disregard following.) Why wouldn't these simply be views, or even materialized views?

It may seem like a slow operation to gather those statistics, however:

proper indexing can make it quite fast
it's not a common operation, so the speed doesn't matter all that much
eliminating redundant data might make other common operations fast and reliable

I've come up with a table schema that
would separate highly volatile data
from other tables subjected to heavy
reads

I guess you're talking about how the user (just to pick one table) events, which would be pretty volatile, are separated from the user data. I agree that it ought to be separate, but more because it's fundamentally different data. What someone is and what someone does are two different things.

I don't think volatility is so important. The DBMS should already allow you to put the log file and database file on separate devices, which accomplishes the same thing, and contention shouldn't be an issue with row-level locking.

Non-relational (still not as bad as
EAV)

I think you're missing the forest for the trees, so to speak.

The predicate for your table would be "User ID from IP IP at time DATE EVENTed to TABLE" which seems reasonable, but there are issues. (Update: Okay, so it's sort of kinda like that.)

You can still join, say, user events to users, but you can't implement a foreign key constraint. That's why EAV is generally problematic; whether or not something is exactly EAV doesn't really matter. It's generally one or two lines of code to implement a constraint in your schema, but in your app it could be dozens of lines of code, and if the same data is accessed in multiple places by multiple apps, it can easily multiply to thousands of lines of code. So, generally, if you can prevent bad data with a foreign key constraint, you're guaranteed that no app will do that.

You might think that events aren't so important, but, as an example, ad impressions are money. I would definitely want to catch any bugs relating to ad impressions as early in the design process as possible.

Further comment

I can spot some caveats but as long as
the app messing with the database
knows what it is doing I guess there
shouldn't be any problems.

And with some caveats you can make a very successful system. With a proper system of constraints, you get to say, "if any app messing with the database doesn't know what it's doing, the DBMS will flag an error." That may require a more time and money than you've got, so something simpler that you can have is probably better than something more perfect that you can't. C'est la vie.

回复收藏 0 原文