如何在 MongoDB 中构建广告工具的数据模型
我正在构建一个广告分析工具,它采用这样的数据结构:
- 帐户
- 活动
- 关键字
- 转换
我有很多有关各个转换事件的信息,这些信息可以与每个活动、关键字、广告组等的成本数据联系起来。在 SQL 中,您可以将每个属性视为特定帐户中的广告系列、关键字或广告的外键(基于文本),但这样做效率低且速度慢。创建 Campaign_id、keyword_id 等字段并填充它们听起来也不是一个好主意,因为我希望分析能够近乎实时地提供。
使用 MongoDB 对此进行建模的好方法是什么?
I am building an ad analytics tool which assumes a data structure like this:
- Account
- Campaign
- Keyword
- Conversion
I have a lot of information about individual conversion events, which can be tied back to the cost data of each campaign, keyword, ad group, etc. In SQL, you could consider each property a sort of foreign key (text-based) to the campaign, keyword or ad in a particular account, but that's inefficient and slow. It doesn't sound like a great idea to make campaign_id, keyword_id, etc. fields and populate them either, because I want the analytics to be available in near-real time.
What would be a good way to model this with MongoDB?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
假设转换事件量非常大(每天数百万或更多),单独的存储引擎(MongoDB 或其他任何引擎)将无法帮助您。您需要的是能够对数据运行映射缩减作业以进行分析计算。您可以根据需要横向扩展集群以实现近乎实时的性能。
我可以建议的免费/开源选项是 Hadoop(可能还有 HBase 和 Hive)或 Riak。
还有其他选择 - 我只建议这两个,因为我在大规模生产环境中拥有使用它们的个人经验。我们目前正在使用 Hadoop 为每天处理数十亿个事件的分析系统提供支持。
如果您不喜欢自己开发,并且有能力并且愿意支付(很多!),那么请考虑 GreenPlum 和 Vertica。
我很乐意分享有关潜在解决方案设计的更多信息 - 但我需要更多有关您想要实现的目标的数据 - 规模、用例等。
Assuming a very high volume of conversion events (millions per day or more), a storage engine alone (MongoDB or anything else) won't help you. What you need is the ability to run map-reduce jobs on the data in order to calculate the analytics. You can scale-out your cluster as necessary to achieve near-real time performance.
The free/open-source options that I can suggest are Hadoop (and probably HBase and Hive) or Riak.
There are other options - I'm only suggesting these two because I've personal experience with them in a high scale production environment. We're currently using Hadoop to power an analytics system processing billions of events per day.
If you're not into rolling your own and are able and willing to pay (a lot!) then look at GreenPlum and Vertica.
I'll be happy to share more information on potential solution designs - but I'll need more data on what you're trying to achieve - scale, use cases etc.
我不确定 MongoDB 是否真的是这样的正确选择,因为 MongoDB 实际上更多的是存储不太好(或更复杂)的文档,而不是像这样的分层记录。但是,如果您选择 MongoDB 路线,那么您可以直接使用帐户、营销活动和关键字标签。在 MongoDB 中将这些抽象为无意义的键并没有实质性的好处。您可以直接在 MongoDB 中对这些字段建立索引。
我不知道您的产量是多少,以及还有哪些其他因素影响您的技术选择。但是,假设您的帐户、广告系列和关键字不会频繁更改,您可以使用普通的旧式 RDBMS(SQL 或 Oracle 等)来执行此操作,使用查找表来查找外键所在的这些决定因素无意义的整数。如果您正在进行实时分析,您可以采用星型模式并将所有数字 FK 保留在基本事实表(转换)上,这样您就不会加入四个表的链来获取整体情况,而是进行三个一跳连接。这将允许您仅通过一次连接即可在任何级别进行汇总。
I'm not sure that MongoDB is really the right choice for something like this, since MongoDB is really more about storing less well (or more complex) documents rather than hierarchical records like this one. However, if you are going the MongoDB route, then you can just use the account, campaign and keyword tags directly. There is no substantive benefit to abstracting these into meaningless keys in MongoDB. You can index these fields directly in MongoDB.
I don't know what your volumes are going to be and what other factors are affecting your technology choices. However, assuming that your accounts, campaigns and keywords don't change that frequently, you could do this with plain old RDBMS (SQL or Oracle etc.) using lookup tables for these determinants where the foreign keys are meaningless integers. If you're doing live analytics you could adopt a star schema and keep all of the numeric FKs on the base fact table (Conversion) so that you aren't joining a chain of four tables to get the whole picture, instead you'd be doing three one-hop joins. This would allow you to summarize at any level with only a single join.