具有大量内部文档的 MongoDB 数据结构
我对 MongoDB 比较陌生,到目前为止给我留下了深刻的印象。不过,我正在努力寻找设置文档存储的最佳方法。我正在尝试使用 Twitter 数据进行一些摘要分析,但我不确定是否将推文放入用户文档中,或者将它们保留为单独的集合。似乎将推文放入用户模型中很快就会达到大小的限制。如果是这种情况,那么能够在一组用户的推文上运行 MapReduce 的好方法是什么?
我希望我没有太含糊,但就设置域模型而言,我不想太具体,也不想在错误的道路上走得太远。
我确信你们都厌倦了听到,我习惯了 RDB 土地,我会在其中布局我的模式,
| USER |
--------
|ID
|Name
|Etc.
|TWEET__|
---------
|ID
|UserID
|Etc
看起来 Mongo 中的逻辑模式会是这样,
User
|-Tweet (0..3000)
|-Entities
|-Hashtags (0..10+)
|-urls (0..5)
|-user_mentions (0..12)
|-GeoData (0..20)
|-somegroupID
但这不会很快使用户文档超出容量。但我想对属于具有相似 somegroupID 的用户的推文进行分析。从概念上讲,上述模型布局是有意义的,但在什么情况下会显得太不方便呢?什么是可行的替代方案?
I am relatively new to MongoDB, and so far am really impressed. I am struggling with the best way to setup my document stores though. I am trying to do some summary analytics using twitter data and I am not sure whether to put the tweets into the user document, or to keep those as a separate collection. It seems like putting the tweets inside the user model would quickly hit the limit with regards to size. If that is the case then what is a good way to be able to run MapReduce across a group of user's tweets?
I hope I am not being too vague but I don't want to get too specific and too far down the wrong path as far as setting up my domain model.
As I am sure you are all bored of hearing, I am used to RDB land where I would lay out my schema like
| USER |
--------
|ID
|Name
|Etc.
|TWEET__|
---------
|ID
|UserID
|Etc
It seems like the logical schema in Mongo would be
User
|-Tweet (0..3000)
|-Entities
|-Hashtags (0..10+)
|-urls (0..5)
|-user_mentions (0..12)
|-GeoData (0..20)
|-somegroupID
but wouldn't that quickly bloat the User document beyond capacity. But I would like to run analysis on tweets belonging to users with similar somegroupID. It conceptually makes sense to to the model layout as above, but at what point is that too unweildy? And what are viable alternatives?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您是对的,您可能会在这里遇到 16MB MongoDB 文档限制。您没有说明您想要运行哪种分析,因此很难推荐模式。 MongoDB 模式在设计时考虑了数据查询(和插入)模式。
当然,您可以轻松地执行相反的操作,将用户 ID 和组 ID 添加到推文文档本身中,而不是将推文放入用户中。然后,如果您需要用户提供其他字段,您始终可以在显示时将其拉入第二个查询。
我的意思是推文文档的设计如下:
然后对于用户集合,文档可能如下所示:
You're right that you'll probably run into the 16MB MongoDB document limit here. You are not saying what sort of analysis you'd like to run, so it is difficult to recommend a schema. MongoDB schemas are designed with the data-query (and insertion) patterns in mind.
Instead of putting your tweets in a user, you can of course quite easily do the opposite, add a user-id and group-id into the tweet documents itself. Then, if you need additional fields from the user, you can always pull that in a second query upon display.
I mean a design for a tweet document as:
And then for a user collection, the documents could look like:
这一切都归功于 MongoHQ.com 的优秀人员。我的问题已在 https://groups.google.com/d 上得到解答/msg/mongodb-user/OtEOD5Kt4sI/qQg68aJH4VIJ
我认为实现以下内容最有意义:
用户
推文
All credit to the fine folks at MongoHQ.com. My question was answered over on https://groups.google.com/d/msg/mongodb-user/OtEOD5Kt4sI/qQg68aJH4VIJ
I think it makes the most sense to implement the following:
user
tweet