如何对分布式数据实现排序和分页?

发布于 2024-09-27 04:59:59 字数 656 浏览 0 评论 0原文

这是我试图解决的问题:

我需要能够显示存储在多个数据库分片中的分页、排序的数据表。

分页和排序是众所周知的问题,当数据来自单一来源时,我们大多数人都可以通过多种方式解决它们。但是,如果您跨分片分割数据或使用 DHT 或分布式文档数据库或您喜欢的任何 NoSQL 风格,事情就会变得更加复杂。

这是一个非常小的数据集的简单图片

:数据
1 |一个
1 | D
1 | G
2 | B
2 | E
2 | H
3 | C
3 | F
3 |我

按页面排序(页面大小 = 3):

页面 |数据
1 |一个
1 | B
1 | C
2 | D
2 | E
2 | F
3 | G
3 | H
3 | I

如果我们想显示用户页面 2,我们会返回:

D
E
F

如果相关表的大小约为 1000 万行或 1 亿行,则您不能将所有数据拉到 Web/应用程序服务器上对其进行排序并返回正确的页面。显然,您不能让每个单独的分片对其自己的数据片段进行排序和分页,因为分片彼此不了解。

让事情变得复杂的是,我需要提供的数据不能太过时,因此提前预先计算一组有用的排序并存储结果以供以后检索是不切实际的。

Here's the problem I'm trying to solve:

I need to be able to display a paged, sorted table of data that is stored across several database shards.

Paging and sorting are well known problems that most of us can solve in any number of ways when the data comes from a single source. But if you're splitting your data across shards or using a DHT or distributed document database or whatever flavor of NoSQL you prefer, things get more complicated.

Here's a simple picture of a really small data set:

Shard | Data
1 | A
1 | D
1 | G
2 | B
2 | E
2 | H
3 | C
3 | F
3 | I

Sorted into pages (Page Size = 3):

Page | Data
1 | A
1 | B
1 | C
2 | D
2 | E
2 | F
3 | G
3 | H
3 | I

And if we wanted to show the user page 2, we'd return:

D
E
F

If the size of the table in question is something like 10 million rows, or 100 million, you can't just pull down all the data onto a web/application server to sort it and return the correct page. And you obviously can't let each individual shard sort and page its own slice of the data because the shards don't know about each other.

To complicate matters, the data I need to present can't be too far out of date, so pre-calculating a set of useful sorts ahead of time and storing the results for later retrieval isn't practical.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

娇女薄笑 2024-10-04 04:59:59

有多种解决方案,其中一些可能对您不可行,但也许其中一种会坚持下去:

  1. 按该值的输入范围进行分片(例如,分片 1 包含 AC,分片 2 DF 等)。或者,使用另一个带有该表外键的表作为索引,并使用该系统对索引表进行分片。这样您就可以轻松找到并获取指定范围。如果你能做到的话,这个解决方案在性能方面可能是最好的(它假设分片数量是静态的并且分片是可靠的)。
  2. 通过二分查找来识别页面项。例如,假设您想要项目 100 到 110。对于每个分片,按字典顺序计算“M”下方的值的数量。如果数字之和大于 100,则减少枢轴点,否则增加它(使用二分查找)。识别出第 100 个项目(页面上的第一个项目)后,从每个分片中取出比该项目大的前 9 (10 - 1) 个项目,获取它们,对整个列表进行排序,从列表中取出前 9 个项目,在前面添加第一项就是您的页面!这种方法更难实现,并且需要 O(log(n)) 查询,因此它比 (1) 慢,但如果负载不是很重,仍然可能相当快。
  3. 存储每个值的页码。这会给你带来极快的读取速度,但写入速度却非常慢,因此它只适用于写入很少的情况(或仅根据有序变量进行附加)。

There are several solutions, some of which may not be feasible for you, but maybe one of them will stick:

  1. Do the sharding by input ranges for this value (e.g., shard 1 contains A-C, shard 2 D-F, etc.). Alternately, use another table with foreign keys to this table as an index, and shard the index table using this system. That way you can easily locate and fetch specified ranges. This solution is probably the best in terms of performance, if you can do it (it assumes that the number of shards is static and the shards are reliable).
  2. Identify the page items by binary search. For example, say you want items 100 to 110. For each shard, count the number of values lexicographically below "M". If the sum of the numbers is above 100, reduce the pivot point, otherwise increase it (using binary search). After you identify the 100th item (the first item on your page), take top 9 (10 - 1) items larger than that item from every shard, fetch them, sort the entire list, take the top 9 from the list, prepend the first item and there's your page! This approach is more difficult to implement and will require O(log(n)) queries so it is slower than (1), but still may be reasonably fast if the load is not very heavy.
  3. Store the page number with each value. This would give you blazingly fast reads, but horribly slow writes, so it only works in the scenario where there are very few writes (or only appends in terms of the ordered variable).
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文