我有 100 万亿个元素,每个元素的大小从 1 字节到 1 万亿字节 (0.909 TiB)。如何有效地存储和访问它们?

发布于 2024-12-20 23:46:30 字数 249 浏览 1 评论 0原文

这是一个面试问题:

假设: 我有 100 万亿个元素,每个元素的大小从 1 字节到 1 万亿字节 (0.909 TiB)。 如何有效地存储和访问它们?

我的想法: 他们想要测试有效处理大量数据的知识。 这不是一个只有一个正确答案的问题。

将它们保存到一些特殊的数据结构中?

其实我对这种开放式问题没什么想法。

非常感谢任何帮助。

This is an interview question :

Suppose:
I have 100 trillion elements, each of them has size from 1 byte to 1 trillion bytes (0.909 TiB).
How to store them and access them very efficiently ?

My ideas :
They want to test the knowledge about handling large volume of data efficiently.
It is not an only-one-correct-answer question.

Save them into some special data structure ?

Actually I have no ideas about this kind of open-end question.

Any help is really appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

捎一片雪花 2024-12-27 23:46:30

这实际上取决于所讨论的数据集。我认为重点是让您讨论替代方案并描述各种优点/缺点。

也许你应该用更多的问题来回答他们的问题!

  • 需要如何访问它? (顺序、随机、某种可预测的分布?)
  • 元素的顺序重要吗?
  • 元素的大小会改变吗?
  • 插入/删除性能有多重要?

您选择的数据结构将取决于您愿意做出什么样的权衡。

例如,如果您只需要按顺序迭代集合,也许您应该使用链表,因为它的存储开销相对较小。

相反,如果您需要随机访问,您可能需要研究:

  • 哈希表(恒定时间查找,但需要良好的数据哈希函数)
  • 某种索引/树结构?
  • 缓存!您可能无法将其全部保存在内存中 - 即使可以,您也希望尽可能利用数据局部性。

TL;DR:这完全取决于问题。有很多选择。

这本质上与文件系统/数据库面临的问题相同。

It really depends on the data-set in question. I think the point is for you to discuss the alternatives and describe the various pros/cons.

Perhaps you should answer their question with more questions!

  • How will it need to be accessed? (sequentially, randomly, some predictable distribution?)
  • Is the order of elements important?
  • Will the size of elements change?
  • How important is insert/remove performance?

The data structure you choose will depend on what kinds of trade-offs you are willing to make.

For example, if you only ever need to iterate over the set sequentially, perhaps you could should use a linked list as this it has a relatively small storage overhead.

If instead you need random access, you might want to look into:

  • Hash-tables (constant time lookup, but need a good hash function for the data)
  • Some kind of index / tree structure?
  • Caching! You probably won't be able to keep it all in memory - and even if you can you want to take advantage of data locality where possible.

TL;DR: It's all problem dependent. There are many alternatives.

This is essentially the same problem faced by file systems / databases.

顾忌 2024-12-27 23:46:30

我会使用某种分布式形式的 B-tree。 B 树能够以非常好的访问时间存储大量数据(树通常不是很深,但很宽)。由于此属性,它可用于关系数据库中的索引。而且将其分布在许多节点(计算机)之间也不会很困难。

我想,这个答案对于面试来说已经足够了......

I would use some distributed form of B-tree. B-tree is able to store huge ammounts of data with very good access times (the tree is usually not very deep, but very broad). Thanks to this property, it is used for indexing in relational databases. And it also wont be very difficult to distribute it among many nodes (computers).

I think, that this answer must be sufficient for an interview...

忘东忘西忘不掉你 2024-12-27 23:46:30

最简单、成本最低(至少在大规模扩展之前)的选择是使用现有的服务,例如 Amazon S3。

The easiest and lowest-cost (at least until you massively scale up) option would be to use an existing service like Amazon S3.

叫嚣ゝ 2024-12-27 23:46:30

好吧,我会使用 DHT 并将其分成 8MB 的块。然后有一个包含文件哈希 (SHA-1 256)、文件名和块的表。

这些块将存储在 3 个不同的 NAS 中。拥有 1200 TB NAS 服务器和负载均衡器,以获取当时更方便获取的 3 个副本中的任何一个。

Well, I would use a DHT and split it in chunks of 8MB. Then have a table with the filehash(SHA-1 256), filename, and chunks.

The chunks would be stored the chunks in 3 different NAS. Have 1200 TB NAS servers and load balancer(s) to get any of the 3 copies that are more convenient to get at the time.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文