SQL:连接与非规范化(大量数据)
我知道,这个问题的变体以前曾被问过。但我的情况可能有点不同:-)
所以,我正在建立一个跟踪事件的网站。每个事件都有 id 和值。它也是由用户执行的,该用户有id、年龄、性别、城市、国家和等级。 (这些属性都是整数,如果重要的话)
我需要能够快速获得两个查询的答案:
- 获取具有特定配置文件的用户的事件数(例如,来自俄罗斯莫斯科的 18-25 岁男性)
- 获取总和(也可能是平均值)来自具有特定配置文件的用户的事件值 -
此外,数据是由多个客户生成的,而这些客户又可以有多个 source_id。
访问模式:数据主要由收集器进程写入,但在查询时(不常见,通过 Web UI)它必须快速响应。
我期望有大量数据,当然不止一个表或单个服务器可以处理。
我正在考虑每天将事件分组到不同的表中(即“events_20111011”)。另外,我想在表名前添加客户 ID 和源 ID,以便数据被隔离并且可以轻松丢弃(清除旧数据)并且相对容易地移动(将负载分配到其他机器)。 这样,每个这样的表的行数就会受到限制,比方说,最多 10M。
那么,问题是:如何处理用户的属性?
选项 1,标准化:将它们存储在单独的表中并从事件表中引用。
- (pro) 没有重复的数据。
- (反对)连接,这是昂贵的(或左右) 我听说)。
- (缺点)这需要用户表和事件表打开 同一服务器
选项2,冗余:将用户属性存储在事件表中并为其建立索引。
- (优点)更容易负载平衡(独立的表可以四处移动)
- (优点)更简单(更快?)查询
- (缺点)大量磁盘空间和内存用于重复用户属性和相应索引
I know, variations of this question had been asked before. But my case may be a little different :-)
So, I am building a site that tracks events. Each event has id and value. It is also performed by a user, which has id, age, gender, city, country and rank. (these attributes are all integers, if it matters)
I need to be able to quickly get answers to two queries:
- get number of events from users with certain profile (for example, males with age 18-25 from Moscow, Russia)
- get sum(maybe avg also) of values of events from users with certain profile -
Also, data is generated by multiple customers, which, in turn, can have multiple source_ids.
Access pattern: data will be mostly written by collector processes, but when queried (infrequently, by web ui) it has to respond quickly.
I expect LOTS of data, certainly more than one table or single server can handle.
I am thinking about grouping events in separate tables per day (that is, 'events_20111011'). Also I want to prefix table name with customer id and source id, so that data is isolated and can be trivially discarded (purge old data) and relatively easily moved around (distribute load to other machines).
This way, every such table will have limited amount of rows, let's say, 10M tops.
So, the question is: what to do with user's attributes?
Option 1, normalized: store them in separate table and reference from event tables.
- (pro) No repetition of data.
- (con) joins, which are expensive (or so
I heard). - (con) this requires user table and event tables to be on
the same server
Option 2, redundant: store user attributes in event tables and index them.
- (pro) easier load balancing (self-contained tables can be moved around)
- (pro) simpler (faster?) queries
- (con) lots of disk space and memory used for repeating user attributes and corresponding indexes
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您的设计应该标准化,但由于性能原因,您的物理模式可能最终会非标准化。
是否可以两者兼得? SQL Server 附带分析服务器是有原因的。即使您不在 Microsoft 领域,拥有一个用于数据输入和日常处理的事务系统也是一种常见的设计,同时报告系统可用于可能会给事务系统带来沉重负载的各种查询。
这样做意味着您可以两全其美:用于日常操作的规范化系统和用于汇总查询的非规范化系统。
在大多数情况下,每晚更新对于报告系统来说很好,但这取决于您的操作时间和其他因素,什么最有效。我发现大多数 8-5 家企业晚上都有足够的时间来更新报告系统。
Your design should be normalized, you physical schema may end up denormalized for performance reasons.
Is it possible to do both? There is a reason why SQL Server ships with Analysis Server. Even if you are not in the Microsoft realm, it is a common design to have a transactional system for the data entry and day to day processing while a reporting system is available for the kinds of queries that would cause heavy loads upon the transactional system.
Doing this means you get the best of both worlds: a normalized system for daily operations and a denormalized system for rollup queries.
In most cases nightly updates are fine for reporting systems, but it depends on your hours of operation and other factors what works best. I find most 8-5 businesses have more than enough time in the evening to update a reporting system.
使用 OLAP/数据仓库方法。也就是说,以标准规范化方式存储数据,同时还将经常查询的数据的聚合版本存储在单独的事实表中。用户查询不会针对实时数据,但为了性能权衡通常是值得的。
另外,如果您使用的是 SQL Server 企业版,我不会推出您自己的水平分区方案(将数据分成几天)。 SQL Server 中内置了一些工具可以自动为您执行此操作。
Use an OLAP/Data Warehousing approach. That is, store your data in the standard normalized way, but also store aggregated versions of the data that will be queried frequently in separate fact tables. The user queries won't be on real-time data, but it is usually worth it for the performance trade off.
Also, if you are using SQL Server enterprise I wouldn't roll your own horizontal partitioning scheme (breaking the data into days). There are tools built into SQL server to automatically do that for you.
请规范
使用分区和索引来平衡负载
Please Normalize
use partitions and indexing to balance load