当前位置：文江博客话题详情

如何应对海量数据查询并将时间控制在1秒以内？

发布于 2024-09-16 08:34:53 字数 116 浏览 3 评论 0原文

我正在思考一个问题，如果我得到一个表，并且里面的数据不断增长，千，万，十亿......
有一天，我认为即使是一个简单的查询也需要几秒钟才能运行。那么有没有什么办法可以将时间控制在1秒或者任何合理的时间之内呢？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

飘过的浮云 2024-09-23 08:34:53

分区。您可以执行的最快 I/O 就是您不需要执行的 I/O。
索引。根据情况，并非针对每一列。您无法让每个查询都以内存速度运行，因此您必须进行选择。
现实主义。您不会在一秒钟内通过关系引擎处理十亿个 I/O。

回复收藏 0 原文

当爱已成负担 2024-09-23 08:34:53

当然要扩散出去。

您可以使用 Hive ( http://wiki.apache.org/hadoop/Hive ) 用于 SQL 查询。

无论您有 10 万行还是 1000 亿行，每个查询都需要几分钟。您的数据将存储在许多不同的计算机上，并且通过 hadoop 的魔力，您的查询将转到数据所在的位置，执行该部分的查询，然后返回结果。

或者，要获得更多限制的更快查询，请查看 Hbase ( http://hbase.apache.org/#Overview< /a>）。它也位于 hadoop 之上，并且速度更快，但代价是更少的 SQL。

回复收藏 0 原文

甜点 2024-09-23 08:34:53

认为你应该社区维基这个，因为不会有一个正确的答案（或者你的问题会更具体）。

首先，扩大蒂姆的索引。 Btree 索引就像一个倒置的金字塔。您的根/“0 级”块可能指向一百个“1 级”块。它们每个都指向一百个“2 级”块，每个都指向一百个“3 级”块。这是一百万个“三级”块，可以指向一亿个数据行。这需要五次读取才能到达该数据集中的任何行（并且可能除了最后两行之外的所有行都缓存在内存中）。再提升一级会将您的数据集提升两个数量级。索引的扩展性非常好，因此，如果您的应用程序用例正在处理非常大的数据集中的小数据量，那就没问题。

分区可以被视为索引的另一种形式，您希望快速排除工作的重要部分。

当您希望处理更大数据集中的大型数据集时，数据仓库设备是第二种解决方案。一般来说，解决方案是使用磁盘来解决问题，无论是否有专用于这些磁盘的 CPU/内存来解决问题。

分布式数据库主要解决一种不同形式的可扩展性，即大量并发用户的可扩展性。 CPU 只能处理有限的内存，因此 CPU 只能处理有限的用户而不会争夺内存。复制在一定程度上发挥了作用，尤其是对于旧式的读取密集型应用程序。较新的 NoSQL 数据库正在解决的问题是做到这一点并获得一致的结果，包括管理备份和恢复以恢复一致性。他们通常通过追求“最终一致性”来实现这一点，接受暂时的不一致作为可扩展性的权衡。

我敢说，很少有 NoSQL 数据库的数据量足以排除 RDBMS 解决方案。相反，推动分布式数据库发展的是用户/事务/写入量。

固态存储也将发挥作用。最近，棕色旋转磁盘的问题与旋转容量的关系不大。它们的运行速度不够快，无法快速访问您可以存储在其上的所有数据。闪存驱动器/卡/内存/缓存基本上消除了阻碍一切的“寻找”时间。

Think you should community wiki this as there won't a single correct answer (or you get a lot more specific in your question).

Firstly, expanding Tim's indexing. Btree indexes are like an upside down pyramid. Your Root/'level 0' block may point to a hundred 'level 1' blocks. They each point to a hundred 'level 2' blocks and they each point to a hundred 'level 3' blocks. That's a million 'level 3' blocks, which can point to a hundred million data rows. That's five reads to get to any row in that dataset (and probably all but the last two are cached in memory). One more level lifts your dataset by two orders of magnitude. Indexes scale REALLY well, so if your application use case is playing with small data volumes in a very large dataset, you're fine.

Partitioning can be seen as an alternative form of indexing, where you want to quickly exclude a significant part of the work.

Datawarehouse appliances are a second solution, when you expect to be dealing with large datasets within even larger datasets. Generally the solution is to throw disks at a problem, with or without CPUs/memory dedicated to those disks to split the problem.

Distributed database are mostly solving a different form of scalability, that of large numbers of concurrent users. There's only so much memory a CPU can address, and therefore only so many users that a CPU can cope with without them fighting over memory. Replication worked to a degree, especially with older style read-heavy applications. The problem that the newer NoSQL database are addressing is to do that and get consistent results, including managing backups and recoveries to restore consistency. They've generally done that by going for 'eventual consistency', accepting transient inconsistencies as the tradeoff for scalability.

I'd venture to say that there are few NoSQL database where the data volume has precluded a RDBMS solution. Rather it's been the user/transaction/write volume that has pushed distributed databases.

Solid State storage will also play a part. The problem with brown spinning disks recently has been less to do with capacity as rotation. They can't go fast enough to quickly access all the data you can store on them. Flash drives/cards/memory/cache basically take out that 'seek' time that is holding everything up.

回复收藏 0 原文