实时统计:MySQL(/Drizzle) 还是 MongoDB?

发布于 2024-11-04 05:38:00 字数 608 浏览 3 评论 0原文

我们正在开发一个项目,该项目将提供某些操作(例如点击)的实时统计数据。 每次点击时,我们都会记录日期、年龄和性别(这些来自 Facebook)、位置等信息。

我们正在讨论存储这些信息并将其用于实时统计的最佳位置。我们将显示汇总统计数据:例如,点击次数、男性/女性点击次数、按年龄组划分的点击次数(例如 18-24、24-30...)。

由于在网站上我们到处都使用 MongoDB,我的同事认为我们也应该在其中存储统计信息。 然而,我更喜欢使用基于 SQL 的数据库来完成此任务,例如 MySQL(或者可能是 Drizzle),因为我相信 SQL 在执行数据聚合等操作时效果更好。尽管存在解析 SQL 的开销,但我认为 MySQL/Drizzle 实际上可能比 No-SQL 数据库更快。当使用 INSERT DELAYED 查询时,插入也不慢。

请注意,我们不需要执行 JOINS 或从多个表/集合收集数据。因此,我们不关心数据库是否不同。 然而,我们确实关心可扩展性和可靠性。我们正在构建的东西(希望)会变得非常大,并且我们在设计每一行代码时都考虑到了可扩展性。

您对此有何看法? 有什么理由比 MySQL/Drizzle 更喜欢 MongoDB 呢?还是无动于衷? 如果你是我们,你会使用哪一个?

谢谢你, 亚历山德罗

We are working on a project that will feature real-time statistics of some actions (e.g. clicks).
On every click, we will log information like date, age and gender (these come from Facebook), location, etc.

We are discussing about the best place to store these information and use them for real-time stats. We will display aggregate statistics: for example, number of clicks, number of clicks made by male/female, number of clicks divided by age groups (e.g. 18-24, 24-30...).

Since on the site we are using MongoDB everywhere, my colleague thought we should store statistics inside it as well.
I, however, would prefer a SQL-based database for this task, like MySQL (or maybe Drizzle), because I believe SQL is better when doing operations like data aggregation. Although there's the overhead of parsing the SQL, I think MySQL/Drizzle may actually be faster than No-SQL databases here. And inserts are not slow too, when using INSERT DELAYED queries.

Please note that we do not need to perform JOINS or collect data from multiple tables/collections. Thus, we don't care if the database is different.
However, we do care about scalability and reliability. We are building something that will (hopefully) become very big, and we've designed every single line of code with scalability in mind.

What do you think about this?
Is there any reason to prefer MongoDB over MySQL/Drizzle for this? Or is it indifferent?
Which one would you use, if you were us?

Thank you,
Alessandro

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

尽揽少女心 2024-11-11 05:38:01

所以 BuddyMedia 正在使用其中的一些。 Gilt Groupe 使用 Hummingbird (node.js + MongoDB) 做了一些非常酷的事情。

我曾在社交媒体领域的一家大型在线广告商工作过,我可以证明实时报告确实很痛苦。尝试每天“汇总”5 亿的展示次数已经是一个挑战,但尝试实时做到这一点是可行的,但它有一些明显的局限性。 (就像实际上延迟了 5 分钟一样:)

坦白说,这类问题是我开始使用 MongoDB 的原因之一。而且我不是唯一一个。人们正在使用 MongoDB 进行各种实时分析:服务器监控集中日志记录,以及仪表板报告。

进行此类报告时,真正的关键是要了解数据结构与 MongoDB 完全不同,您将避免“聚合”查询,因此查询和输出图表将会有所不同。客户端有一些额外的编码工作。

下面的关键可能会为您指明使用 MongoDB 执行此操作的正确方向。看一下下面的数据结构:

{
  date: "20110430",
  gender: "M",
  age: 1, // 1 is probably a bucket
  impression_hour: [ 100, 50, ...], // 24 of these
  impression_minute: [ 2, 5, 19, 8, ... ], // 1440 of these
  clicks_hour: [ 10, 2, ... ],
  ...
}

这里显然有一些调整,适当的索引,可能会将数据+性别+年龄混合到_id中。但这就是使用 MongoDB 进行点击分析的基本结构。更新展示次数和点击次数 { $inc : { clicks_hour.0 : 1 } } 非常简单。您可以自动更新整个文档。报告实际上是很自然的。您已经拥有一个包含每小时或分钟级数据点的数组。

希望这能为您指明正确的方向。

So BuddyMedia is using some of this. The Gilt Groupe has done something pretty cool with Hummingbird (node.js + MongoDB).

Having worked for a large online advertiser in the Social Media space, I can attest that real-time reporting is really a pain. Trying to "roll-up" 500M impressions a day is already a challenge, but trying to do it real time worked, but it carried some significant limitations. (like it was actually delayed by 5-minutes :)

Frankly, this type of problem is one of the reasons I started using MongoDB. And I'm not the only one. People are using MongoDB for all kinds of real-time analytics: server monitoring, centralized logging, as well as dashboard reporting.

The real key when doing this type of reporting is to understand that the data structure is completely different with MongoDB, you're going to avoid "aggregation" queries, so the queries and the output charts are going to be different. There's some extra coding work on the client side.

Here's the key that may point you in the right direction for doing this with MongoDB. Take a look at the following data structure:

{
  date: "20110430",
  gender: "M",
  age: 1, // 1 is probably a bucket
  impression_hour: [ 100, 50, ...], // 24 of these
  impression_minute: [ 2, 5, 19, 8, ... ], // 1440 of these
  clicks_hour: [ 10, 2, ... ],
  ...
}

There are obviously some tweaks here, appropriate indexes, maybe mushing data+gender+age into an _id. But that's kind of the basic structure of click analytics with MongoDB. It's really easy to update impression and clicks { $inc : { clicks_hour.0 : 1 } }. You get to update the whole document atomically. And it's actually pretty natural to report on. You already have your an array containing your hourly or minute-level data points.

Hopefully that's points you in the right direction.

家住魔仙堡 2024-11-11 05:38:01

MongoDB 非常适合这种事情,并且肯定会比 MySQL 更快,尽管不要低估 MySQL 的强大功能 - 许多公司已经用它构建了分析工具。

看看 BuddyMedia 的 Patrick Stokes 的演讲,了解他们如何使用 MongoDB 来构建他们的分析系统。

http://www.slideshare.net/pstokes2/social-analytics-with-mongodb

MongoDB is great for this kind of thing and will certainly be faster than MySQL will be, although don't underestimate how powerful MySQL can be - many companies have built analytics tools with it.

Have a look at this presentation by Patrick Stokes of BuddyMedia on how they used MongoDB for their analytic system.

http://www.slideshare.net/pstokes2/social-analytics-with-mongodb

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文