我有 100 万亿个元素,每个元素的大小从 1 字节到 1 万亿字节 (0.909 TiB)。如何有效地存储和访问它们?
这是一个面试问题:
假设: 我有 100 万亿个元素,每个元素的大小从 1 字节到 1 万亿字节 (0.909 TiB)。 如何有效地存储和访问它们?
我的想法: 他们想要测试有效处理大量数据的知识。 这不是一个只有一个正确答案的问题。
将它们保存到一些特殊的数据结构中?
其实我对这种开放式问题没什么想法。
非常感谢任何帮助。
This is an interview question :
Suppose:
I have 100 trillion elements, each of them has size from 1 byte to 1 trillion bytes (0.909 TiB).
How to store them and access them very efficiently ?
My ideas :
They want to test the knowledge about handling large volume of data efficiently.
It is not an only-one-correct-answer question.
Save them into some special data structure ?
Actually I have no ideas about this kind of open-end question.
Any help is really appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这实际上取决于所讨论的数据集。我认为重点是让您讨论替代方案并描述各种优点/缺点。
也许你应该用更多的问题来回答他们的问题!
您选择的数据结构将取决于您愿意做出什么样的权衡。
例如,如果您只需要按顺序迭代集合,也许您应该使用链表,因为它的存储开销相对较小。
相反,如果您需要随机访问,您可能需要研究:
TL;DR:这完全取决于问题。有很多选择。
这本质上与文件系统/数据库面临的问题相同。
It really depends on the data-set in question. I think the point is for you to discuss the alternatives and describe the various pros/cons.
Perhaps you should answer their question with more questions!
The data structure you choose will depend on what kinds of trade-offs you are willing to make.
For example, if you only ever need to iterate over the set sequentially, perhaps you could should use a linked list as this it has a relatively small storage overhead.
If instead you need random access, you might want to look into:
TL;DR: It's all problem dependent. There are many alternatives.
This is essentially the same problem faced by file systems / databases.
我会使用某种分布式形式的 B-tree。 B 树能够以非常好的访问时间存储大量数据(树通常不是很深,但很宽)。由于此属性,它可用于关系数据库中的索引。而且将其分布在许多节点(计算机)之间也不会很困难。
我想,这个答案对于面试来说已经足够了......
I would use some distributed form of B-tree. B-tree is able to store huge ammounts of data with very good access times (the tree is usually not very deep, but very broad). Thanks to this property, it is used for indexing in relational databases. And it also wont be very difficult to distribute it among many nodes (computers).
I think, that this answer must be sufficient for an interview...
最简单、成本最低(至少在大规模扩展之前)的选择是使用现有的服务,例如 Amazon S3。
The easiest and lowest-cost (at least until you massively scale up) option would be to use an existing service like Amazon S3.
好吧,我会使用 DHT 并将其分成 8MB 的块。然后有一个包含文件哈希 (SHA-1 256)、文件名和块的表。
这些块将存储在 3 个不同的 NAS 中。拥有 1200 TB NAS 服务器和负载均衡器,以获取当时更方便获取的 3 个副本中的任何一个。
Well, I would use a DHT and split it in chunks of 8MB. Then have a table with the filehash(SHA-1 256), filename, and chunks.
The chunks would be stored the chunks in 3 different NAS. Have 1200 TB NAS servers and load balancer(s) to get any of the 3 copies that are more convenient to get at the time.