从大型应用程序收集分析的策略

发布于 2024-08-11 20:34:52 字数 487 浏览 10 评论 0原文

Superfeedr 是一种提要解析按需服务。我们希望为用户提供分析,并且正在研究实现这一目标的最佳策略。

简而言之,我们希望跟踪系统中的操作数量(事件,例如:给定提要中的新条目)以及聚合数据(提要的订阅者数量)。

当然,可以根据事件“计算”聚合数据。 (订阅源的订阅者数量等于订阅的总和减去取消订阅的总和)。然而,由于我们想要研究随着时间的推移(每天的订阅者数量),事件方法可能不是最优的,因为我们会一遍又一遍地重新计算相同的事情。

如何在您的应用程序中构建这样一个组件?什么信息流?什么数据存储?什么绘图解决方案?等等...

我知道这是一个相当悬而未决的问题,但我确信我们不是第一个有这种需求的人!

[更新]: 基础设施:我们有一组工作人员,它们是 XMPP 客户端,并且一起交互。它们基于 EventMachine,这意味着它们不会阻塞 IO。 期望的目标:我们必须能够收集大量数据。目前,我们已经达到约 200-300 条消息/秒,我们的目标是 10 倍至 100 倍。

Superfeedr is a feed-parsing on demand service. We want to provide analytics to our users and we're investigating what would be the best strategy to do so.

In a nutshell, we want to track the number of operations (events, like : new entry in a given feed) in our system as well as agregated data (number of subscriber for feed).

Of course, the agregated data can be "computed" based on the the events. (the number of susbcribers to a feed is the sum of subscriptions, minus the sum of unsubscriptions). Yet, since we want to study that over time (number of susbcribers on a daily basis), the evented approach may be sub-optimal, since we would re-compute the same thing over and over.

How would one build such a component in your app? What information flow? What data-stores? What graphing solution? etc...

I know this is quite an open question, but I am sure we're not the first ones with such a need!

[UPDATE]:
Infrastructure : We have a set of workers, that are XMPP clients and interact all together. They are based out of EventMachine, which means that they do not block on IO.
Desired target : we must be able to collect massive amounts of data. Currently, we are already at about 200-300 msg/sec and we aim at 10x-100x that.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

冷︶言冷语的世界 2024-08-18 20:34:52

如果没有有关您的基础设施和所需扩展目标的更多信息,很难说。您可能会发现此幻灯片有关 Twitter 如何使用 Hadoop< /a> 具有指导意义。它由 Kevin Weil 在最近的 NoSQL East 会议

替代文本

借鉴 Twitter 正在做的事情的想法,您可以考虑将架构分为收集、分析和渲染阶段。

收集阶段:超低延迟。非常具有可扩展性。很多装订选择。在 facebook 开发。

处理节点日志事件-> 抄写员-> HDFS


分析阶段:类似 SQL 的查询语言,允许您执行探索性的即席查询。

HDFS -> -> MySQL

渲染阶段:在您当前的 Web 框架中实现

MySQL-> JSON->内存缓存 ->闪存图表

这里有一些关于 SO 的帖子,涉及为网络选择 Flash 图表组件。我个人在 AmCharts 方面取得了巨大成功。

It's tough to say without more information about your infrastructure and desired scaling targets. You may find this slide deck about How Twitter Uses Hadoop to be instructional. It was presented by Kevin Weil at the recent NoSQL East conference.

alt text

Borrowing ideas from what Twitter is doing you could consider an architecture split into collection, analysis and render phases.

Collection Phase: Super low latency. Very scalable. Lots of binding choices. Developed at facebook.

Processing Node Log Event -> Scribe -> HDFS

Analysis Phase: SQL-like query language that will allow you to do exploratory ad-hoc queries as well.

HDFS -> Pig -> MySQL

Render Phase: Implemented in your current web framework

MySQL -> JSON -> Memcached -> Flash Charting

There have been some posts here on SO regarding choice of Flash charting components for thew web. I personally have had good success with AmCharts.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文