Web 访问日志的实时数据仓库

发布于 2024-08-16 14:21:20 字数 732 浏览 2 评论 0原文

我们正在考虑建立一个数据仓库系统来加载我们的网络服务器生成的网络访问日志。这个想法是实时加载数据。

我们希望向用户呈现数据的折线图,并使用户能够使用维度进行向下钻取。

问题是如何平衡和设计系统,以便;

(1) 可以实时(<2 秒)获取数据并呈现给用户,

(2) 可以按小时和每天聚合数据,以及

(2) 数据量很大仍然可以存储在仓库中,并且

我们当前的数据速率约为每秒约 10 次访问,这使我们每天约 800k 行。我对 MySQL 和简单星型模式的简单测试表明,当我们有超过 800 万行时,我的查询开始花费超过 2 秒的时间。

是否有可能从这样的“简单”数据仓库获得实时查询性能, 并且仍然可以存储大量数据(如果能够永远丢弃任何数据就好了)

是否有方法将数据聚合到更高分辨率的表中?

我有一种感觉,这并不是一个真正的新问题(不过我已经用谷歌搜索了很多)。也许有人可以给这样的数据仓库解决方案点分吗?我想到的就是 Splunk。

或许是我抓的太多了。

更新

我的架构如下所示;

  • 尺寸:

    • 客户端(IP 地址)
    • 服务器
    • 网址
  • 事实;

    • 时间戳(以秒为单位)
    • 传输的字节数

We're thinking about putting up a data warehouse system to load with web access logs that our web servers generate. The idea is to load the data in real-time.

To the user we want to present a line graph of the data and enable the user to drill down using the dimensions.

The question is how to balance and design the system so that ;

(1) the data can be fetched and presented to the user in real-time (<2 seconds),

(2) data can be aggregated on per-hour and per-day basis, and

(2) as large amount of data can still be stored in the warehouse, and

Our current data-rate is roughly ~10 accesses per second which gives us ~800k rows per day. My simple tests with MySQL and a simple star schema shows that my quires starts to take longer than 2 seconds when we have more than 8 million rows.

Is it possible it get real-time query performance from a "simple" data warehouse like this,
and still have it store a lot of data (it would be nice to be able to never throw away any data)

Are there ways to aggregate the data into higher resolution tables?

I got a feeling that this isn't really a new question (i've googled quite a lot though). Could maybe someone give points to data warehouse solutions like this? One that comes to mind is Splunk.

Maybe I'm grasping for too much.

UPDATE

My schema looks like this;

  • dimensions:

    • client (ip-address)
    • server
    • url
  • facts;

    • timestamp (in seconds)
    • bytes transmitted

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

牵你的手,一向走下去 2024-08-23 14:21:20

赛斯上面的答案是一个非常合理的答案,我相信如果你投资适当的知识和硬件,成功的机会就很大。

Mozilla 进行了大量的网络服务分析。我们每小时跟踪详细信息,并使用商业数据库产品 Vertica。这种方法非常有效,但由于它是专有的商业产品,因此具有不同的相关成本。

您可能想要研究的另一种技术是 MongoDB。它是一个文档存储数据库,具有一些使其非常适合此用例的功能。
即,上限集合(搜索 mongodb 上限集合以获取更多信息)

以及用于跟踪页面浏览量、点击量等的快速增量操作。
http://blog.mongodb.org/post/ 171353301/使用-mongodb-进行实时分析

Seth's answer above is a very reasonable answer and I feel confident that if you invest in the appropriate knowledge and hardware, it has a high chance of success.

Mozilla does a lot of web service analytics. We keep track of details on an hourly basis and we use a commercial DB product, Vertica. It would work very well for this approach but since it is a proprietary commercial product, it has a different set of associated costs.

Another technology that you might want to investigate would be MongoDB. It is a document store database that has a few features that make it potentially a great fit for this use case.
Namely, the capped collections (do a search for mongodb capped collections for more info)

And the fast increment operation for things like keeping track of page views, hits, etc.
http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics

知足的幸福 2024-08-23 14:21:20

听起来这不会是一个问题。 MySQL非常快。

为了存储日志数据,请使用 MyISAM 表——它们速度更快并且非常适合 Web 服务器日志。 (我认为 InnoDB 是当今新安装的默认设置 - 外键和 InnoDB 的所有其他功能对于日志表来说并不是必需的)。您还可以考虑使用 合并 表 -您可以将各个表保持在可管理的大小,同时仍然能够将它们作为一张大表进行访问。

如果您仍然无法跟上,那么请按顺序为自己配备更多内存、更快的磁盘、RAID 或更快的系统。

另外:永远不要丢弃数据可能是一个坏主意。如果每行大约 200 字节长,则每年至少需要 50 GB,仅用于原始日志记录数据。如果有索引,则至少乘以二。再次乘以(至少)二以进行备份。

如果您愿意,您可以保留全部数据,但在我看来,您应该考虑将原始数据存储几周,将汇总数据存储几年。对于较旧的内容,只需存储报告即可。 (也就是说,除非法律要求您留下来。即使如此,也可能不会超过 3-4 年)。

Doesn't sound like it would be a problem. MySQL is very fast.

For storing logging data, use MyISAM tables -- they're much faster and well suited for web server logs. (I think InnoDB is the default for new installations these days - foreign keys and all the other features of InnoDB aren't necessary for the log tables). You might also consider using merge tables - you can keep individual tables to a manageable size while still being able to access them all as one big table.

If you're still not able to keep up, then get yourself more memory, faster disks, a RAID, or a faster system, in that order.

Also: Never throwing away data is probably a bad idea. If each line is about 200 bytes long, you're talking about a minimum of 50 GB per year, just for the raw logging data. Multiply by at least two if you have indexes. Multiply again by (at least) two for backups.

You can keep it all if you want, but in my opinion you should consider storing the raw data for a few weeks and the aggregated data for a few years. For anything older, just store the reports. (That is, unless you are required by law to keep around. Even then, it probably won't be for more than 3-4 years).

衣神在巴黎 2024-08-23 14:21:20

另外,请考虑分区,尤其是当您的查询主要访问最新数据时;例如,您可以设置约 550 万行的每周分区。

如果每天和每小时聚合,请考虑使用日期和时间维度 - 您没有列出它们,所以我假设您没有使用它们。这个想法是不在查询中包含任何函数,例如 HOUR(myTimestamp) 或 DATE(myTimestamp)。日期维度的分区方式应与事实表相同。

有了这个,查询优化器就可以使用分区修剪,因此表的总大小不会像以前那样影响查询响应。

Also, look into partitioning, especially if your queries mostly access latest data; you could -- for example -- set-up weekly partitions of ~5.5M rows.

If aggregating per-day and per hour, consider having date and time dimensions -- you did not list them so I assume you do not use them. The idea is not to have any functions in a query, like HOUR(myTimestamp) or DATE(myTimestamp). The date dimension should be partitioned the same way as fact tables.

With this in place, the query optimizer can use partition pruning, so the total size of tables does not influence the query response as before.

可爱暴击 2024-08-23 14:21:20

这已经成为一个相当常见的数据仓库应用程序。我已经运行一个程序多年,每天支持 20-1 亿行,响应时间为 0.1 秒(来自数据库),来自 Web 服务器的响应时间超过一秒。这甚至不在大型服务器上。

您的数据量不太大,所以我认为您不需要非常昂贵的硬件。但我仍然会选择多核、64 位和大量内存。

但是您希望主要访问聚合数据而不是详细数据 - 特别是对于数天、数月等的时间序列图表。聚合数据可以通过异步过程定期在数据库上创建,或者在这种情况下通常可以工作如果转换数据的 ETL 流程能够创建聚合数据,那就最好了。请注意,聚合通常只是事实表的分组依据。

正如其他人所说 - 访问详细数据时分区是一个好主意。但这对于汇总数据来说并不那么重要。此外,对预先创建的维度值的依赖比对函数或存储过程的依赖要好得多。这两种都是典型的数据仓库策略。

关于数据库 - 如果是我,我会尝试 Postgresql 而不是 MySQL。原因主要是优化器成熟度:postgresql 可以更好地处理您可能运行的查询类型。 MySQL 更容易对五路连接感到困惑,当你运行子查询时会自下而上,等等。如果这个应用程序很有价值,那么我会考虑商业数据库,如 db2、oracle、sql server。然后,您将获得额外的功能,例如查询并行性、针对聚合表的自动查询重写、额外的优化器复杂性等。

This has gotten to be a fairly common data warehousing application. I've run one for years that supported 20-100 million rows a day with 0.1 second response time (from database), over a second from web server. This isn't even on a huge server.

Your data volumes aren't too large, so I wouldn't think you'd need very expensive hardware. But I'd still go multi-core, 64-bit with a lot of memory.

But you will want to mostly hit aggregate data rather than detail data - especially for time-series graphing over days, months, etc. Aggregate data can be either periodically created on your database through an asynchronous process, or in cases like this is typically works best if your ETL process that transforms your data creates the aggregate data. Note that the aggregate is typically just a group-by of your fact table.

As others have said - partitioning is a good idea when accessing detail data. But this is less critical for the aggregate data. Also, reliance on pre-created dimensional values is much better than on functions or stored procs. Both of these are typical data warehousing strategies.

Regarding the database - if it were me I'd try Postgresql rather than MySQL. The reason is primarily optimizer maturity: postgresql can better handle the kinds of queries you're likely to run. MySQL is more likely to get confused on five-way joins, go bottom up when you run a subselect, etc. And if this application is worth a lot, then I'd consider a commercial database like db2, oracle, sql server. Then you'd get additional features like query parallelism, automatic query rewrite against aggregate tables, additional optimizer sophistication, etc.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文