如何处理网络统计模块的大量数据
我正在为我的网站开发一个统计模块,它将帮助我衡量转化率和其他有趣的数据。
我使用的机制是 - 每次用户进入我的数据库中的特定区域时,将数据库条目存储在统计表中(我在 cookie 的帮助下避免重复记录)。
例如,我有以下区域:
- 网站 - 由于我最近不再信任 Google Analytics,因此用于计算唯一用户的常规区域。
- 类别 - 自我描述。
- 迷你网站 - 自我描述。
- 产品图片 - 每当用户看到产品和潜在客户提交表单时。
问题是一个月后,我的统计表中充满了很多行,而我编写的用于解析数据的 ASP.NET 页面加载非常很慢。
我想也许编写一个能够以某种方式解析数据的服务,但我看不到任何方法可以在不失去灵活性的情况下做到这一点。
我的问题:
- 像 Google Analytics 这样的大规模数据解析应用程序加载数据的速度有多快?
- 对我来说最好的方法是什么?
- 也许我的数据库设计是错误的,我应该只将数据存储在一张表中?
感谢任何提供帮助的人,
Eytan。
I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
- Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
- Category - self descriptive.
- Minisite - self descriptive.
- Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
- How large scale data parsing applications - like Google Analytics load the data so fast?
- What is the best way for me to do it?
- Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您正在寻找的基本方法称为聚合。
您对根据数据计算的某些函数感兴趣,而不是在启动显示网站时“在线”计算数据,而是通过夜间批处理或在写入日志记录时增量地离线计算它们。
一个简单的增强功能是存储每个用户/会话的计数,而不是存储每次点击并对其进行计数。 这会将您的分析处理要求减少一个因素(按每个会话的点击次数的顺序)。 当然,插入日志条目时会增加处理成本。
另一种聚合称为在线分析处理,它仅沿着数据的某些维度进行聚合,允许用户在浏览模式下聚合其他维度。 这需要权衡性能、存储和灵活性。
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
看来使用两个数据库可以做得很好。 一种用于事务数据,它处理所有 INSERT 语句。 另一个用于报告并处理您的所有查询请求。
您可以将鼻涕从报告数据库中索引出来,和/或对数据进行非规范化,以便在查询中使用更少的联接。 定期将数据从事务数据库导出到报告数据库。 此行为将与前面提到的聚合想法一起改善报告响应时间。
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
另一个需要知道的技巧是分区。 查看在您选择的数据库中如何完成此操作 - 但基本上的想法是,您告诉数据库将表划分为多个子表,每个子表基于某个值具有相同的定义。
就您而言,“范围分区”非常有用——根据值所属的范围选择分区。 如果按日期范围进行分区,则可以为每周(或每天或每月 - 取决于您如何使用数据以及数据量)创建单独的子表。
这意味着,如果您在发出查询时指定日期范围,则该范围之外的数据甚至不会被考虑; 这可以节省大量时间,甚至比索引更好(索引必须考虑每一行,因此它会随着数据的增长而增长;分区每天一个)。
这使得在线查询(当您点击 ASP 页面时发出)和用于预先计算必要统计数据的聚合查询变得更快。
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.