具有大量内部文档的 MongoDB 数据结构

发布于 2025-01-06 20:45:34 字数 680 浏览 0 评论 0原文

我对 MongoDB 比较陌生,到目前为止给我留下了深刻的印象。不过,我正在努力寻找设置文档存储的最佳方法。我正在尝试使用 Twitter 数据进行一些摘要分析,但我不确定是否将推文放入用户文档中,或者将它们保留为单独的集合。似乎将推文放入用户模型中很快就会达到大小的限制。如果是这种情况,那么能够在一组用户的推文上运行 MapReduce 的好方法是什么?

我希望我没有太含糊,但就设置域模型而言,我不想太具体,也不想在错误的道路上走得太远。

我确信你们都厌倦了听到,我习惯了 RDB 土地,我会在其中布局我的模式,

| USER |
--------
|ID
|Name
|Etc.

|TWEET__|
---------
|ID
|UserID
|Etc

看起来 Mongo 中的逻辑模式会是这样,

User
|-Tweet (0..3000)
  |-Entities
    |-Hashtags (0..10+)
    |-urls (0..5)
    |-user_mentions (0..12)
  |-GeoData (0..20)
|-somegroupID

但这不会很快使用户文档超出容量。但我想对属于具有相似 somegroupID 的用户的推文进行分析。从概念上讲,上述模型布局是有意义的,但在什么情况下会显得太不方便呢?什么是可行的替代方案?

I am relatively new to MongoDB, and so far am really impressed. I am struggling with the best way to setup my document stores though. I am trying to do some summary analytics using twitter data and I am not sure whether to put the tweets into the user document, or to keep those as a separate collection. It seems like putting the tweets inside the user model would quickly hit the limit with regards to size. If that is the case then what is a good way to be able to run MapReduce across a group of user's tweets?

I hope I am not being too vague but I don't want to get too specific and too far down the wrong path as far as setting up my domain model.

As I am sure you are all bored of hearing, I am used to RDB land where I would lay out my schema like

| USER |
--------
|ID
|Name
|Etc.

|TWEET__|
---------
|ID
|UserID
|Etc

It seems like the logical schema in Mongo would be

User
|-Tweet (0..3000)
  |-Entities
    |-Hashtags (0..10+)
    |-urls (0..5)
    |-user_mentions (0..12)
  |-GeoData (0..20)
|-somegroupID

but wouldn't that quickly bloat the User document beyond capacity. But I would like to run analysis on tweets belonging to users with similar somegroupID. It conceptually makes sense to to the model layout as above, but at what point is that too unweildy? And what are viable alternatives?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

强者自强 2025-01-13 20:45:34

您是对的,您可能会在这里遇到 16MB MongoDB 文档限制。您没有说明您想要运行哪种分析,因此很难推荐模式。 MongoDB 模式在设计时考虑了数据查询(和插入)模式。

当然,您可以轻松地执行相反的操作,将用户 ID 和组 ID 添加到推文文档本身中,而不是将推文放入用户中。然后,如果您需要用户提供其他字段,您始终可以在显示时将其拉入第二个查询。

我的意思是推文文档的设计如下:

{
    'hashtags': [ '#foo', '#bar' ],
    'urls': [ "http://url1.example.com", 'http://url2.example.com' ],
    'user_mentions' : [ 'queen_uk' ],
    'geodata': { ... },
    'userid': 'derickr',
    'somegroupid' : 40
}

然后对于用户集合,文档可能如下所示:

{
    'userid' : 'derickr',
    'realname' : Derick Rethans',
    ...
}

You're right that you'll probably run into the 16MB MongoDB document limit here. You are not saying what sort of analysis you'd like to run, so it is difficult to recommend a schema. MongoDB schemas are designed with the data-query (and insertion) patterns in mind.

Instead of putting your tweets in a user, you can of course quite easily do the opposite, add a user-id and group-id into the tweet documents itself. Then, if you need additional fields from the user, you can always pull that in a second query upon display.

I mean a design for a tweet document as:

{
    'hashtags': [ '#foo', '#bar' ],
    'urls': [ "http://url1.example.com", 'http://url2.example.com' ],
    'user_mentions' : [ 'queen_uk' ],
    'geodata': { ... },
    'userid': 'derickr',
    'somegroupid' : 40
}

And then for a user collection, the documents could look like:

{
    'userid' : 'derickr',
    'realname' : Derick Rethans',
    ...
}
她比我温柔 2025-01-13 20:45:34

这一切都归功于 MongoHQ.com 的优秀人员。我的问题已在 https://groups.google.com/d 上得到解答/msg/mongodb-user/OtEOD5Kt4sI/qQg68aJH4VIJ

克里斯·温斯莱特@MongoHQ

<小时>

您会发现此视频很有趣:

http://www.10gen.com/presentations/ mongosv-2011/schema-design-at-scale

本质上,在一个文档中,存储一天的推文
人。推理:

  • 查询通常由天数和用户组成

因此,可以有如下索引:

{user_id: 1, date: 1} # 日期必须是最后一个,因为您将范围
并按日期排序

玩得开心!

克里斯·蒙戈总部


我认为实现以下内容最有意义:

用户

{ user_id: 123123,
  screen_name: 'cledwyn',
  misc_bits: {...},
  groups: [123123_group_tall_people, 123123_group_techies, ],
  groups_in: [123123_group_tall_people]
}

推文

{ tweet_id: 98798798798987987987987,
  user_id: 123123,
  tweet_date: 20120220,
  text: 'MongoDB is pretty sweet',
  misc_bits: {...},
  groups_in: [123123_group_tall_people]
}

All credit to the fine folks at MongoHQ.com. My question was answered over on https://groups.google.com/d/msg/mongodb-user/OtEOD5Kt4sI/qQg68aJH4VIJ

Chris Winslett @ MongoHQ


You will find this video interesting:

http://www.10gen.com/presentations/mongosv-2011/schema-design-at-scale

Essentially, in one document, store one days of tweets for one
person. The reasoning:

  • Querying typically consists of days and users

Therefore, you can have the following index:

{user_id: 1, date: 1} # Date needs to be last because you will range
and sort on the date

Have fun!

Chris MongoHQ


I think it makes the most sense to implement the following:

user

{ user_id: 123123,
  screen_name: 'cledwyn',
  misc_bits: {...},
  groups: [123123_group_tall_people, 123123_group_techies, ],
  groups_in: [123123_group_tall_people]
}

tweet

{ tweet_id: 98798798798987987987987,
  user_id: 123123,
  tweet_date: 20120220,
  text: 'MongoDB is pretty sweet',
  misc_bits: {...},
  groups_in: [123123_group_tall_people]
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文