为什么这个基于范围的查询如此快

发布于 2024-10-03 20:55:06 字数 1236 浏览 0 评论 0原文

在工作中，我们对具有以下结构的表进行了查询：

ip_from(number), ip_to(number), country, city, state, isp, latitude, longitude.

该表大约有 610 万行。

为了找出给定 IP 地址的详细信息，我们使用了如下查询：

SELECT * 
  FROM Ip2location
WHERE
  :ip_num BETWEEN ip_from AND ip_to;

在我们的开发数据库中的 Oracle 10 上，返回一行大约需要 17 秒，具体取决于传入的 ip_num。我们更强大的实时系统可能需要 5-6 秒，这对于实时来说仍然太慢，我们需要通过后台作业来选择它。

不理想，尤其是当我们的实时系统确实需要 IP 详细信息时。

使用的索引类型是标准 BTREE 索引，涵盖 ip_from 和 ip_to。我们研究了很多东西来尝试加快速度，例如范围分区。我们最终没有应用它，因为它需要 Oracle Enterprise。我们还考虑了增加表的并发性，但这没有明显的效果。

无论如何，当我早上喝咖啡时，我意识到我认为通过运行以下查询可以提高性能：（这是从记忆中得出的，可能有一些错误。此外，我们选择了单个字段，而不是所有字段）

SELECT * 
  FROM ip2location
WHERE 
  ip_from = (
    SELECT max(ip_from)
      FROM ip2location
      WHERE ip_from <= :ip_num
  )
AND
  ip_to >= ip_num;

这适用于我们的数据设置是因为 ip_from 和 ip_to 之间没有重叠范围。

然而，我没有准备好的是第二个查询的速度有多快。我们的开发数据库的时间从 17 秒减少到 0.007 秒。

这对我来说毫无意义。我预计性能会有所提高，但不会那么多。难道数据库统计数据不应该发现没有重叠并进行相应的优化吗？还必须有一种公认的更快的方法来选择使用范围？

我的问题是：为什么即使使用子选择第二个查询也会快得多？

原文

At work we had a query on a table that had the following structure:

ip_from(number), ip_to(number), country, city, state, isp, latitude, longitude.

This table had approx 6.1 million rows.

To find out the details for a given IP address we used a query like the following:

SELECT * 
  FROM Ip2location
WHERE
  :ip_num BETWEEN ip_from AND ip_to;

On Oracle 10 in our dev database this took approximately 17 seconds to return a row, depending on the ip_num passed in. On our beefier live system it took maybe 5-6 seconds, which was still too slow to do in real time and we needed to select this via a background job.

Not ideal, especially as our real time systems really needed the ip details.

The type of index used was a standard BTREE index spanning both ip_from and ip_to. We looked into a lot of things to try to speed this up such as range partitioning. We didn't apply that in the end as it requires Oracle Enterprise. We also looked at increasing the concurrency of the table but that had no noticeable effect.

Anyway when having my morning coffee I realised that I thought there could be a performance enhancement by running the following query: (This is from memory, there may be a couple of mistakes. Also we selected individual fields not everything)

SELECT * 
  FROM ip2location
WHERE 
  ip_from = (
    SELECT max(ip_from)
      FROM ip2location
      WHERE ip_from <= :ip_num
  )
AND
  ip_to >= ip_num;

This works for our data set because there's no overlapping ranges between ip_from and ip_to.

However what I wasn't prepared for is how much faster the second query is. The time on our dev database was reduced from 17 seconds to 0.007 seconds.

This makes little sense to me. I'd expect some performance increase, but not that much. Shouldn't the database statistics have figured out there is no overlap and optimised accordingly? Also there has to be a recognised quicker way to select using ranges?

My question is: why is the second query so much faster even using a sub-select?

分享到QQ

分享到微博