如何对分布式数据实现排序和分页?
这是我试图解决的问题:
我需要能够显示存储在多个数据库分片中的分页、排序的数据表。
分页和排序是众所周知的问题,当数据来自单一来源时,我们大多数人都可以通过多种方式解决它们。但是,如果您跨分片分割数据或使用 DHT 或分布式文档数据库或您喜欢的任何 NoSQL 风格,事情就会变得更加复杂。
这是一个非常小的数据集的简单图片
:数据
1 |一个
1 | D
1 | G
2 | B
2 | E
2 | H
3 | C
3 | F
3 |我
按页面排序(页面大小 = 3):
页面 |数据
1 |一个
1 | B
1 | C
2 | D
2 | E
2 | F
3 | G
3 | H
3 | I
如果我们想显示用户页面 2,我们会返回:
D
E
F
如果相关表的大小约为 1000 万行或 1 亿行,则您不能将所有数据拉到 Web/应用程序服务器上对其进行排序并返回正确的页面。显然,您不能让每个单独的分片对其自己的数据片段进行排序和分页,因为分片彼此不了解。
让事情变得复杂的是,我需要提供的数据不能太过时,因此提前预先计算一组有用的排序并存储结果以供以后检索是不切实际的。
Here's the problem I'm trying to solve:
I need to be able to display a paged, sorted table of data that is stored across several database shards.
Paging and sorting are well known problems that most of us can solve in any number of ways when the data comes from a single source. But if you're splitting your data across shards or using a DHT or distributed document database or whatever flavor of NoSQL you prefer, things get more complicated.
Here's a simple picture of a really small data set:
Shard | Data
1 | A
1 | D
1 | G
2 | B
2 | E
2 | H
3 | C
3 | F
3 | I
Sorted into pages (Page Size = 3):
Page | Data
1 | A
1 | B
1 | C
2 | D
2 | E
2 | F
3 | G
3 | H
3 | I
And if we wanted to show the user page 2, we'd return:
D
E
F
If the size of the table in question is something like 10 million rows, or 100 million, you can't just pull down all the data onto a web/application server to sort it and return the correct page. And you obviously can't let each individual shard sort and page its own slice of the data because the shards don't know about each other.
To complicate matters, the data I need to present can't be too far out of date, so pre-calculating a set of useful sorts ahead of time and storing the results for later retrieval isn't practical.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
有多种解决方案,其中一些可能对您不可行,但也许其中一种会坚持下去:
O(log(n))
查询,因此它比 (1) 慢,但如果负载不是很重,仍然可能相当快。There are several solutions, some of which may not be feasible for you, but maybe one of them will stick:
O(log(n))
queries so it is slower than (1), but still may be reasonably fast if the load is not very heavy.