如何设计分区标签系统的数据存储?

发布于 2024-08-29 07:13:54 字数 1091 浏览 3 评论 0原文

如何为庞大的标签系统(如digg或delicious)设计数据存储?

已经有关于它的讨论,但它是关于中心化的数据库。由于数据应该会增长,因此我们迟早需要将数据划分为多个分片。所以,问题变成了:如何为分区标签系统设计数据存储?

标签系统基本上有 3 个表:

Item (item_id, item_content)

Tag (tag_id, tag_title)

TagMapping(map_id, tag_id, item_id)

这对于查找给定标签的所有项目以及查找给定项目的所有标签来说效果很好,如果该表存储在一个数据库实例中。如果我们需要将数据分区到多个数据库实例中,那就没那么容易了。

对于表Item,我们可以使用其键item_id对其内容进行分区。对于表Tag,我们可以使用其键tag_id对其内容进行分区。例如,我们要将表Tag分区为K个数据库。我们可以简单地选择数字(tag_id % K)数据库来存储给定的标签。

但是,如何对表TagMapping进行分区呢?

TagMapping 表表示多对多关系。我只能想象有重复。即相同内容的TagMapping有两份。一种是用 tag_id 分区,另一种是用 item_id 分区。在查找给定项目的标签的场景中,我们使用带有 tag_id 的分区。如果要查找给定标签的项目,我们将使用带有 item_id 的分区。

因此,存在数据冗余。并且,应用程序级别应该保持所有表的一致性。看起来很难。

有没有更好的方案来解决这个多对多分区问题?

How to design data storage for huge tagging system (like digg or delicious)?

There is already discussion about it, but it is about centralized database. Since the data is supposed to grow, we'll need to partition the data into multiple shards soon or later. So, the question turns to be: How to design data storage for partitioned tagging system?

The tagging system basically has 3 tables:

Item (item_id, item_content)

Tag (tag_id, tag_title)

TagMapping(map_id, tag_id, item_id)

That works fine for finding all items for given tag and finding all tags for given item, if the table is stored in one database instance. If we need to partition the data into multiple database instances, it is not that easy.

For table Item, we can partition its content with its key item_id. For table Tag, we can partition its content with its key tag_id. For example, we want to partition table Tag into K databases. We can simply choose number (tag_id % K) database to store given tag.

But, how to partition table TagMapping?

The TagMapping table represents the many-to-many relationship. I can only image to have duplication. That is, same content of TagMappping has two copies. One is partitioned with tag_id and the other is partitioned with item_id. In scenario to find tags for given item, we use partition with tag_id. If scenario to find items for given tag, we use partition with item_id.

As a result, there is data redundancy. And, the application level should keep the consistency of all tables. It looks hard.

Is there any better solution to solve this many-to-many partition problem?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

秋风の叶未落 2024-09-05 07:13:54

我怀疑是否有一种方法可以优化所有可能的使用场景。正如您所说,TagMapping 表支持两种主要场景:查找给定项目的标签,以及查找具有给定标签的项目。我认为对于每个可能感兴趣的场景,您将如何使用 TagMapping 表存在一些差异。我只能根据典型的标记应用程序做出合理的假设,所以如果这是错误的,请原谅我!

查找给定项目的标签

A1。您将一次显示给定项目的所有标签

A2。您将确保项目的所有标签都是唯一的

查找给定标签的项目

B1。您一次需要给定标签的一些项目(以填充一页搜索结果)

B2。您可能允许用户指定多个标签,因此您需要查找与多个标签匹配的一些项目

B3。您将通过某种流行程度对给定标签(或多个标签)的项目进行排序

鉴于上述情况,我认为一个好的方法是按项目对 TagMapping 进行分区。这样,给定项目的所有标签都位于一个分区上。分区可以更细粒度,因为项目的数量可能远多于标签,并且每个项目只有少数标签。这使得检索变得容易 (A1),并且可以在单个分区内强制执行唯一性 (A2)。此外,该单个分区可以告诉您某个项目是否与多个标签匹配 (B2)。

由于您一次只需要给定标签(或多个标签)的一些项目 (B1),因此您可以按某种顺序一次查询一个分区,直到有足够多的记录需要填充为止一页结果。您需要查询的分区数量取决于您拥有的分区数量、您想要显示的结果数量以及标签的使用频率。每个分区在 tag_id 上都有自己的索引,以有效地回答此查询。

您选择分区的顺序非常重要,因为它将影响搜索结果的分组方式。如果顺序不重要(即 B3 不重要),请随机选择分区,以免任何分区变得太热。如果排序很重要,您可以构造项目 id,以便它对与结果排序顺序相关的信息进行编码。适当的分区方案将注意这种编码。例如,如果结果是按受欢迎程度排序的 URL,则您可以将顺序项目 ID 与该 URL 的 Google Page Rank 分数(或任何类似内容)结合起来。分区方案必须确保给定分区内的所有项目具有相同的分数。查询将按分数顺序选择分区,以确保首先返回更受欢迎的项目 (B3)。显然,这仅允许一种排序,并且涉及的属性应该是恒定的,因为它们现在是键的一部分并确定记录的分区。但这并不是一个真正的新限制,因为无论如何使用分区数据支持各种排序或易失性属性的排序并不容易。

I doubt there is a single approach that optimizes all possible usage scenarios. As you said, there are two main scenarios that the TagMapping table supports: finding tags for a given item, and finding items with a given tag. I think there are some differences in how you will use the TagMapping table for each scenario that may be of interest. I can only make reasonable assumptions based on typical tagging applications, so forgive me if this is way off base!

Finding Tags for a Given Item

A1. You're going to display all of the tags for a given item at once

A2. You're going to ensure that all of an item's tags are unique

Finding Items for a Given Tag

B1. You're going to need some of the items for a given tag at a time (to fill a page of search results)

B2. You might allow users to specify multiple tags, so you'd need to find some of the items matching multiple tags

B3. You're going to sort the items for a given tag (or tags) by some measure of popularity

Given the above, I think a good approach would be to partition TagMapping by item. This way, all of the tags for a given item are on one partition. Partitioning can be more granular, since there are likely far more items than tags and each item has only a handful of tags. This makes retrieval easy (A1) and uniqueness can be enforced within a single partition (A2). Additionally, that single partition can tell you if an item matches multiple tags (B2).

Since you only need some of the items for a given tag (or tags) at a time (B1), you can query partitions one at a time in some order until you have as many records needed to fill a page of results. How many partitions you will have to query will depend on how many partitions you have, how many results you want to display and how frequently the tag is used. Each partition would have its own index on tag_id to answer this query efficiently.

The order you pick partitions in will be important as it will affect how search results are grouped. If ordering isn't important (i.e. B3 doesn't matter), pick partitions randomly so that none of your partitions get too hot. If ordering is important, you could construct the item id so that it encodes information relevant to the order in which results are to be sorted. An appropriate partitioning scheme would then be mindful of this encoding. For example, if results are URLs that are sorted by popularity, then you could combine a sequential item id with the Google Page Rank score for that URL (or anything similar). The partitioning scheme must ensure that all of the items within a given partition have the same score. Queries would pick partitions in score order to ensure more popular items are returned first (B3). Obviously, this only allows for one kind of sorting and the properties involved should be constant since they are now part of a key and determine the record's partition. This isn't really a new limitation though, as it isn't easy to support a variety of sorts, or sorts on volatile properties, with partitioned data anyways.

一紙繁鸢 2024-09-05 07:13:54

规则是按要查询的字段进行分区。否则你将不得不查看所有分区。您确定只需要通过 tag_id 查询标签表吗?我相信不是,您还需要按标签标题进行查询。对于 Item 表来说并不是那么明显,但当其他用户为其分配标签时,您可能还想通过 URL 等查询来查找它的 item_id 。

但请注意,标签和项目表具有不可变的标题和 URL。这意味着您可以使用以下技术:

  1. 从标题(对于标签)或 URL(对于项目)选择分区。
  2. 选择该分区生成 id 的顺序。

您可以使用分区本地 ID 对作为全局标识符,也可以使用不重叠的数字集。无论如何,现在您可以根据 id 和 title/URL 字段计算分区。提前不知道分区数量或担心将来可能会发生变化?创建更多它们并加入群组,以便您将来可以重新组合它们。

当然,您不能对 TagMapping 表执行相同的操作,因此您必须进行复制。需要通过map_id、tag_id、item_id来查询吧?因此,即使没有分区,您也必须通过创建 3 个索引来复制数据。因此,不同之处在于您对每个索引使用不同的分区(按不同的字段)。我认为没有理由担心。

The rule is that you partition by field that you are going to query by. Otherwise you'll have to look through all partitions. Are you sure you'll need to query Tag table by tag_id only? I believe not, you'll also need to query by tag title. It's no so obvious for Item table, but probably you also would like to query by something like URL to find item_id for it when other user will assign tags for it.

But note, that Tag and Item tables has immutable title and URL. That means you can use the following technique:

  1. Choose partition from title (for Tag) or URL (for Item).
  2. Choose sequence for this partition to generate id.

You either use partition-localID pair as global identifier or use non-overlapping number sets. Anyway, now you can compute partition from both id and title/URL fields. Don't know number of partitions in advance or worrying it might change in future? Create more of them and join in groups, so that you can regroup them in future.

Sure, you can't do the same for TagMapping table, so you have to duplicate. You need to query it by map_id, by tag_id, by item_id, right? So even without partitioning you have to duplicate data by creating 3 indexes. So the difference is that you use different partitioning (by different field) for each index. I see no reason to worry about.

素染倾城色 2024-09-05 07:13:54

您的查询很可能与用户主题相关。这意味着您应该将与这些相关的所有信息集中在一处。

您谈论的是数据库的分布,通常这主要是一个同步问题。通常大约 90% 的工作的读取可以在复制数据库上完成。问题是如何更新一个数据库并与所有其他数据库保持一致,并且不会影响性能。这取决于您的场景详细信息。

另一种可能性是按照您的要求对所有数据进行分区而不重叠。您可能会按用户 ID 或主题 ID 进行分区。如果按主题 ID 进行分区,一个数据库可以引用所有主题,并仅告诉哪个专用数据库正在保存数据。然后就可以查询正确的了。由于您按 ID 进行分区,因此与该主题相关的所有信息都可能位于该专用数据库中。对于国际网站,您还可以按语言国家/地区进行分区。

最后但并非最不重要的一点是,您可能最终会混合两者:一些不重叠的数据和一些重叠(复制)的数据。首先找到常用操作,然后找到如何在一个数据库上以最少可能的查询进行这些操作。

PS:不要忘记缓存,它会比分布式数据库节省更多。

Most likely your queries are going to be related to a user or a topic. Meaning that you should have all info related to those in one place.

You're talking about distribution of DB, usually this is mostly an issue of synchronization. Reading, which is about 90% of the work usually, can be done on a replicated database. The issue is how to update one DB and remain consistent will all others and without killing the performances. This depends on your scenario details.

The other possibility is to partition, like you asked, all the data without overlapping. You probably would partition by user ID or topic ID. If you partition by topic ID, one database could reference all topics and just telling which dedicated DB is holding the data. You can then query the correct one. Since you partition by ID, all info related to that topic could be on that specialized database. You could partition also by language or country for an international website.

Last but not least, you'll probably end up mixing the two: Some non-overlapping data, and some overlapping (replicated) data. First find usual operations, then find how to make those on one DB in least possible queries.

PS: Don't forget about caching, it'll save you more than distributed-DB.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文