如何在数据库中组织基于时间的事件进行聚合(例如访问)?
我需要为我的用户提供有关其页面的不同类型访问者的基于时间的统计数据,例如上周/上个月/总体有多少访问者,以及这些访问者来自哪里。理想情况下,数据应足够丰富,以便在可能的情况下进行扩展/定制/重新聚合。
我的问题是如何最好地在数据库中组织它。一种选择是为每次访问创建一个新记录,其中包含访问者的 ID 和日期。
我怀疑这种方法虽然为我提供了功能,但会占用大量存储空间。有更好的方法来解决这个问题吗?
I have a need to give my users time-based statistics about different kinds of visitors to their pages, such as how many visitors in the last week / last month / overall, and perhaps where those visitors are coming from. Ideally, the data would be rich enough to allow for expansion/customization/re-aggregation if possible.
My question is how best to organize this in the database. One option would be to create a new record for each visit, containing the visitor's id and the date.
I suspect that such an approach, while giving me the functionality, would be storage-heavy. Is there a better way to approach this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
公认的行业惯例是拥有跟踪级表和汇总级表。跟踪数据是指事件的详细日志。尽管存储不是问题,但计数所需的时间将是一个问题。按小时查询访问次数比统计跟踪表的记录要快得多。
假设您的跟踪级别表是
当您的进程插入跟踪表时,它还会创建/更新汇总表的记录。
确定您希望累积事件的最精细时间粒度。它可以是分钟、小时、天等。
您还需要确定是否希望按会话或页面请求进行计数。您会将页面上的重复访问视为多次访问吗?您必须确定所有此类分母。
定义一个计数器:
假设您已确定按小时的粒度最适合网站上的流量。
按整个网站的会话进行累积
在每小时开始时,计数器会重置,并将开始日期时间设置为当前日期时间。每次发生事件计数都会增加。就您而言,这将是一个新的会话事件。
当下一小时到来时,计数器数据将写入表中的新记录中
,并且计数器将再次重置并将开始日期时间设置为当前日期时间。
您现在拥有的是一个按小时累积新访问事件的表。因此,现在您可以选择指定范围内给定日、周、月、年内所有按小时记录的范围内的所有按小时记录的计数总和。
为了不影响 Web 服务的响应,Web 服务应将这些事件写入总线或队列,由单独的数据加载进程拾取。
每个访问页面的累计
但是,您可能希望按页进行累积
累积器进程需要为每个 pageId 维护一个累积器类的实例:
并且,计数记录将是:
每个 userId、pageId、sessionId、clientIp 进行累积< /strong>
然后,您进一步确定需要累积每个用户 ID 的访问次数。在这种情况下,您可能会认为按天累积比按小时累积更好。如果您按小时累积,您可能还需要依赖跟踪表,因为计算每小时启动新会话或访问的用户的稀缺性与依赖跟踪表一样具有良好的粒度。
累积器进程需要为每个 paramId 维护一个累积器类的实例:
并且,计数记录将是:
其中 paramName 可以是“pageId”、“userId”、“clientIp”等。
为了防止丢失数据,您可以更新每十分钟对 ByDay 表进行一次更改的任何计数器实例,而不重置计数器实例。仅当这一天跨越到新的一天时才重置开始和计数。
“修改”字段不会写入数据库表。它是一个标志,用于确定 Counter 实例是否需要更新 ByDay 表。如果发生更新,“已修改”字段将重置为“假”。
The accepted industry practice is to have a trace-level table and summary-level tables. Trace data means detail log of events. Even though storage is not an issue, the time it takes to count would be an issue. It would be much quicker to query visit counts by the hour than counting the records of the trace table.
Let us say your trace-level table is
As your process inserts into the trace table, it would also create/update records for the summary table.
Determine a finest granularity of time you wish to accumulate events. It could be minute, hour, day, etc.
You also need to determine if you wish to count by session or by page request. Would you count a repeated visit on a page as more than one visit? You have to determine all such denominators.
Define a counter:
Let us say that you have determined the granularity of by-hour is the most appropriate for the traffic on the site.
Cumulating by session over the whole site
At the start of the hour, the counter is reset and the start datetime is set to the current datetime. count is incremented per event that occurs. In your case, it would be a new session event.
When the next hour arrives, the Counter data is written to a new record in the table
And again, the counter is reset and the start datetime is set to the current datetime.
What you have now is a table that cumulates incidents of new visits by the hour. So that now you could select the sum of count over all ByHour records over the ranges of all ByHour records that fall within a given day, week, month, year of a specified range.
In order not to affect the response of the web service, the web service should write those events into a bus or queue, which would be picked up by a separate data loading process.
Cumulating per page visited
However, you might want to cumulate by pages
The cumulator process would need to maintain an instance of the cumulator class per pageId:
And, the count record would be:
Cumulating per userId, pageId, sessionId, clientIp, anything-goes
Then, you further determine that you need to cumulate visits per user id. In that case, you might determine that it would be better cumulate by day rather than by the hour. If you cumulated it by hour, you might as well depend on the trace table because counting the scarcity of a user starting a new session or visiting per hour would be as good a glanularity as depending on the trace table.
The cumulator process would need to maintain an instance of the cumulator class per paramId:
And, the count record would be:
Where paramName could be "pageId", "userId", "clientIp", etc.
To prevent losing data, you could update any Counter instances that had changes to the ByDay table every ten minutes, without reseting the counter instances. Reset the start and count only when the day crosses over to a new day.
The "modified" field is not written to the database table. It is a flag to determine if a Counter instance needs to update the ByDay table. If an update occurs, the "modified" field is reset to "false".
我会追求“每次访问都有新记录”。不然编程会很复杂。
您可以根据站点的流量来估计需要多少存储空间。执行此操作后,您可能会决定不再担心存储问题,因为对于现代 DBMS 来说,存储通常不是问题。
如果存储确实是一个问题,那么您需要事先决定(并修复)您需要哪些统计数据。然后,您可以记录一个月的每次访问。在每个月末,您总结该月并将此摘要添加到应根据前几个月的统计数据编制的现有摘要中。
I would go for "a new record for each visit". Programming will be complicated, otherwise.
You can get an estimate of how much storage you need depending on the traffic of your site. Do this, and you might decide not to worry about storage because storage is not usually an issue for modern day DBMSs.
If storage is really a problem, then you need to decide (and fix) what statistics you need, beforehand. Then, you can record each visit for just a single month. At the end of each month, you summarize that month and add this summary to your existing summary which should have been compiled from the previous months' statistics.