如何应对海量数据查询并将时间控制在1秒以内?

发布于 2024-09-16 08:34:53 字数 116 浏览 3 评论 0原文

我正在思考一个问题,如果我得到一个表,并且里面的数据不断增长,千,万,十亿......
有一天,我认为即使是一个简单的查询也需要几秒钟才能运行。 那么有没有什么办法可以将时间控制在1秒或者任何合理的时间之内呢?

I am thinking through a problem, if I get a table, and the data in it keep growing, thousand, million, billion ....
One day, I think even a simple query it will need several seconds to run.
So is there any means which we can use to control the time within 1 second or any reasonable time ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

飘过的浮云 2024-09-23 08:34:53
  1. 分区。您可以执行的最快 I/O 就是您不需要执行的 I/O。

  2. 索引。根据情况,并非针对每一列。您无法让每个查询都以内存速度运行,因此您必须进行选择。

  3. 现实主义。您不会在一秒钟内通过关系引擎处理十亿个 I/O。

  1. Partitioning. The fastest I/O's you can do are the one you don't need to do.

  2. Indexing. As appropriate, not for every column. You can't make every query run at memory speed, so you have to pick and choose.

  3. Realism. You're not processing a billion I/Os through a relational engine in a single second.

当爱已成负担 2024-09-23 08:34:53

当然要扩散出去。

您可以使用 Hive ( http://wiki.apache.org/hadoop/Hive ) 用于 SQL 查询。

无论您有 10 万行还是 1000 亿行,每个查询都需要几分钟。您的数据将存储在许多不同的计算机上,并且通过 hadoop 的魔力,您的查询将转到数据所在的位置,执行该部分的查询,然后返回结果。

或者,要获得更多限制的更快查询,请查看 Hbase ( http://hbase.apache.org/#Overview< /a>)。它也位于 hadoop 之上,并且速度更快,但代价是更少的 SQL。

Sure spread it out.

You can use something like Hive ( http://wiki.apache.org/hadoop/Hive ) for SQL queries.

It will take a few minutes per query, weather you have 100 thousand rows or 100 billion rows. You will have data living on many different computers, and though the magic of hadoop, your query will go out to where the data lives, do the query of that part, and come back with the results.

Or for faster queries with more limitations, look at Hbase ( http://hbase.apache.org/#Overview ). It also sits on top of hadoop, and is a little faster at the tradeoff of less SQL like.

甜点 2024-09-23 08:34:53

认为你应该社区维基这个,因为不会有一个正确的答案(或者你的问题会更具体)。

首先,扩大蒂姆的索引。 Btree 索引就像一个倒置的金字塔。您的根/“0 级”块可能指向一百个“1 级”块。它们每个都指向一百个“2 级”块,每个都指向一百个“3 级”块。这是一百万个“三级”块,可以指向一亿个数据行。这需要五次读取才能到达该数据集中的任何行(并且可能除了最后两行之外的所有行都缓存在内存中)。再提升一级会将您的数据集提升两个数量级。索引的扩展性非常好,因此,如果您的应用程序用例正在处理非常大的数据集中的小数据量,那就没问题。

分区可以被视为索引的另一种形式,您希望快速排除工作的重要部分。

当您希望处理更大数据集中的大型数据集时,数据仓库设备是第二种解决方案。一般来说,解决方案是使用磁盘来解决问题,无论是否有专用于这些磁盘的 CPU/内存来解决问题。

分布式数据库主要解决一种不同形式的可扩展性,即大量并发用户的可扩展性。 CPU 只能处理有限的内存,因此 CPU 只能处理有限的用户而不会争夺内存。复制在一定程度上发挥了作用,尤其是对于旧式的读取密集型应用程序。较新的 NoSQL 数据库正在解决的问题是做到这一点并获得一致的结果,包括管理备份和恢复以恢复一致性。他们通常通过追求“最终一致性”来实现这一点,接受暂时的不一致作为可扩展性的权衡。

我敢说,很少有 NoSQL 数据库的数据量足以排除 RDBMS 解决方案。相反,推动分布式数据库发展的是用户/事务/写入量。

固态存储也将发挥作用。最近,棕色旋转磁盘的问题与旋转容量的关系不大。它们的运行速度不够快,无法快速访问您可以存储在其上的所有数据。闪存驱动器/卡/内存/缓存基本上消除了阻碍一切的“寻找”时间。

Think you should community wiki this as there won't a single correct answer (or you get a lot more specific in your question).

Firstly, expanding Tim's indexing. Btree indexes are like an upside down pyramid. Your Root/'level 0' block may point to a hundred 'level 1' blocks. They each point to a hundred 'level 2' blocks and they each point to a hundred 'level 3' blocks. That's a million 'level 3' blocks, which can point to a hundred million data rows. That's five reads to get to any row in that dataset (and probably all but the last two are cached in memory). One more level lifts your dataset by two orders of magnitude. Indexes scale REALLY well, so if your application use case is playing with small data volumes in a very large dataset, you're fine.

Partitioning can be seen as an alternative form of indexing, where you want to quickly exclude a significant part of the work.

Datawarehouse appliances are a second solution, when you expect to be dealing with large datasets within even larger datasets. Generally the solution is to throw disks at a problem, with or without CPUs/memory dedicated to those disks to split the problem.

Distributed database are mostly solving a different form of scalability, that of large numbers of concurrent users. There's only so much memory a CPU can address, and therefore only so many users that a CPU can cope with without them fighting over memory. Replication worked to a degree, especially with older style read-heavy applications. The problem that the newer NoSQL database are addressing is to do that and get consistent results, including managing backups and recoveries to restore consistency. They've generally done that by going for 'eventual consistency', accepting transient inconsistencies as the tradeoff for scalability.

I'd venture to say that there are few NoSQL database where the data volume has precluded a RDBMS solution. Rather it's been the user/transaction/write volume that has pushed distributed databases.

Solid State storage will also play a part. The problem with brown spinning disks recently has been less to do with capacity as rotation. They can't go fast enough to quickly access all the data you can store on them. Flash drives/cards/memory/cache basically take out that 'seek' time that is holding everything up.

自由如风 2024-09-23 08:34:53

索引将解决 90% 的问题。在二叉树中从一百万个元素中查找一个唯一元素只需要遍历 30 个节点(记录总数的 0.003%)。

根据数据,您可以制作聚合表。因此,如果您记录统计数据并每 5 分钟采样一次,您可以简单地将数据聚合到一个表中,其中每行代表一小时、一天等时间段内的平均读数。

Indexing will solve 90% of your problems. Finding one unique element out of a million in a binary tree will require traversing only 30 nodes (0.003% of the total number of records).

Depending on the data, you could make aggregation tables. So if you were recording a statistic and sampled every 5 minutes, you could simply aggregate the data into a table with each row representing an average reading over a period of an hour, a day, and so on.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文