为什么这个基于范围的查询如此快

发布于 2024-10-03 20:55:06 字数 1236 浏览 0 评论 0原文

在工作中,我们对具有以下结构的表进行了查询:

ip_from(number), ip_to(number), country, city, state, isp, latitude, longitude.

该表大约有 610 万行。

为了找出给定 IP 地址的详细信息,我们使用了如下查询:

SELECT * 
  FROM Ip2location
WHERE
  :ip_num BETWEEN ip_from AND ip_to;

在我们的开发数据库中的 Oracle 10 上,返回一行大约需要 17 秒,具体取决于传入的 ip_num。我们更强大的实时系统可能需要 5-6 秒,这对于实时来说仍然太慢,我们需要通过后台作业来选择它。

理想,尤其是当我们的实时系统确实需要 IP 详细信息时。

使用的索引类型是标准 BTREE 索引,涵盖 ip_fromip_to。我们研究了很多东西来尝试加快速度,例如范围分区。我们最终没有应用它,因为它需要 Oracle Enterprise。我们还考虑了增加表的并发性,但这没有明显的效果。

无论如何,当我早上喝咖啡时,我意识到我认为通过运行以下查询可以提高性能:(这是从记忆中得出的,可能有一些错误。此外,我们选择了单个字段,而不是所有字段)

SELECT * 
  FROM ip2location
WHERE 
  ip_from = (
    SELECT max(ip_from)
      FROM ip2location
      WHERE ip_from <= :ip_num
  )
AND
  ip_to >= ip_num;

这适用于我们的数据设置是因为 ip_fromip_to 之间没有重叠范围。

然而,我没有准备好的是第二个查询的速度有多快。我们的开发数据库的时间从 17 秒减少到 0.007 秒。

这对我来说毫无意义。我预计性能会有所提高,但不会那么多。难道数据库统计数据不应该发现没有重叠并进行相应的优化吗?还必须有一种公认的更快的方法来选择使用范围?

我的问题是:为什么即使使用子选择第二个查询也会快得多?

At work we had a query on a table that had the following structure:

ip_from(number), ip_to(number), country, city, state, isp, latitude, longitude.

This table had approx 6.1 million rows.

To find out the details for a given IP address we used a query like the following:

SELECT * 
  FROM Ip2location
WHERE
  :ip_num BETWEEN ip_from AND ip_to;

On Oracle 10 in our dev database this took approximately 17 seconds to return a row, depending on the ip_num passed in. On our beefier live system it took maybe 5-6 seconds, which was still too slow to do in real time and we needed to select this via a background job.

Not ideal, especially as our real time systems really needed the ip details.

The type of index used was a standard BTREE index spanning both ip_from and ip_to. We looked into a lot of things to try to speed this up such as range partitioning. We didn't apply that in the end as it requires Oracle Enterprise. We also looked at increasing the concurrency of the table but that had no noticeable effect.

Anyway when having my morning coffee I realised that I thought there could be a performance enhancement by running the following query: (This is from memory, there may be a couple of mistakes. Also we selected individual fields not everything)

SELECT * 
  FROM ip2location
WHERE 
  ip_from = (
    SELECT max(ip_from)
      FROM ip2location
      WHERE ip_from <= :ip_num
  )
AND
  ip_to >= ip_num;

This works for our data set because there's no overlapping ranges between ip_from and ip_to.

However what I wasn't prepared for is how much faster the second query is. The time on our dev database was reduced from 17 seconds to 0.007 seconds.

This makes little sense to me. I'd expect some performance increase, but not that much. Shouldn't the database statistics have figured out there is no overlap and optimised accordingly? Also there has to be a recognised quicker way to select using ranges?

My question is: why is the second query so much faster even using a sub-select?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

黎歌 2024-10-10 20:55:06

性能提升是显而易见的。这是因为 ip_from 上有一个索引,所以可以在恒定时间内获得 max(ip_from) ,因为正如你所知,索引对值进行排序。由于 b 树上的二分搜索,范围也很容易计算。

而在上一个查询中必须对所有数据进行表扫描以计算范围边界

the performance increase is obvious. Its because there is an index on ip_from , so max(ip_from) can be obtained in constant time because as you know indexing sorts out the values. the range is also easily computed because of binary search over the btree.

while in the previous query has to do a table scan all over the data to compute the range bounds

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文