MySQL“IN”运算符在(大量?)值上的性能
我最近一直在尝试 Redis 和 MongoDB,似乎经常会在 MongoDB 或 Redis 中存储 id 数组。对于这个问题,我将继续使用 Redis,因为我正在询问 MySQL IN 运算符。
我想知道在 IN 运算符内列出大量(300-3000)id 的性能如何,看起来像这样:
SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 3000)
想象一下像 products 这样简单的东西您通常可以将它们连接在一起以获取特定类别中的产品。在上面的示例中,您可以看到,在 Redis 中的给定类别 ( category:4:product_ids
) 下,我返回了 id 为 4 的类别中的所有产品 id,并将它们放在上面的 中
查询。IN
运算符内的 SELECT
这有多高效?
这是“视情况而定”的情况吗?或者是否有一个具体的“这是(不)可接受的”或“快”或“慢”,或者我应该添加一个LIMIT 25
,还是没有帮助?
SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 3000)
LIMIT 25
或者我应该修剪 Redis 返回的产品 id 数组,将其限制为 25,并且仅向查询添加 25 个 id,而不是 3000,然后从查询内部将其限制为 25?
SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 25)
非常感谢任何建议/反馈!
I have been experimenting with Redis and MongoDB lately and it would seem that there are often cases where you would store an array of id's in either MongoDB or Redis. I'll stick with Redis for this question since I am asking about the MySQL IN operator.
I was wondering how performant it is to list a large number (300-3000) of id's inside the IN operator, which would look something like this:
SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 3000)
Imagine something as simple as a products and categories table which you might normally JOIN together to get the products from a certain category. In the example above you can see that under a given category in Redis ( category:4:product_ids
) I return all the product ids from the category with id 4, and place them in the above SELECT
query inside the IN
operator.
How performant is this?
Is this an "it depends" situation? Or is there a concrete "this is (un)acceptable" or "fast" or "slow" or should I add a LIMIT 25
, or doesn't that help?
SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 3000)
LIMIT 25
Or should I trim the array of product id's returned by Redis to limit it to 25 and only add 25 id's to the query rather than 3000 and LIMIT
-ing it to 25 from inside the query?
SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 25)
Any suggestions/feedback is much appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
一般来说,如果
IN
列表变得太大(对于某些定义不明确的“太大”值,通常在 100 或更小范围内),使用连接会变得更有效,如果需要的话创建一个临时表来保存数字。如果数字是密集的集合(没有间隙 - 示例数据表明),那么您可以使用
WHERE id BETWEEN 300 AND 3000
做得更好。但是,假设集合中存在间隙,此时最好使用有效值列表(除非间隙数量相对较少,在这种情况下您可以使用:
或者无论间隙是什么。
Generally speaking, if the
IN
list gets too large (for some ill-defined value of 'too large' that is usually in the region of 100 or smaller), it becomes more efficient to use a join, creating a temporary table if need so be to hold the numbers.If the numbers are a dense set (no gaps - which the sample data suggests), then you can do even better with
WHERE id BETWEEN 300 AND 3000
.However, presumably there are gaps in the set, at which point it may be better to go with the list of valid values after all (unless the gaps are relatively few in number, in which case you could use:
Or whatever the gaps are.
我一直在做一些测试,正如 David Fells 在他的回答中所说,它得到了很好的优化。作为参考,我创建了一个包含 1,000,000 个寄存器的 InnoDB 表,并使用“IN”运算符对 500,000 个随机数进行选择,在我的 MAC 上只需要 2.5 秒;仅选择偶数寄存器需要 0.5 秒。
我遇到的唯一问题是我必须增加
my.cnf
文件中的max_allowed_packet
参数。如果没有,就会生成一个神秘的“MYSQL 已消失”错误。这是我用来进行测试的 PHP 代码:
以及结果:
I have been doing some tests, and as David Fells says in his answer, it is quite well optimized. As a reference, I have created an InnoDB table with 1,000,000 registers and doing a select with the "IN" operator with 500,000 random numbers, it takes only 2.5 seconds on my MAC; selecting only the even registers takes 0.5 seconds.
The only problem that I had is that I had to increase the
max_allowed_packet
parameter from themy.cnf
file. If not, a mysterious “MYSQL has gone away” error is generated.Here is the PHP code that I use to make the test:
And the results:
您可以创建一个临时表,在其中可以放置任意数量的 ID 并运行嵌套查询
示例:
并选择:
You can create a temporary table where you can put any number of IDs and run nested query
Example:
and select:
实际上,在大量记录列表上使用带有大量参数集的
IN
会很慢。在我最近解决的案例中,我有两个 where 子句,一个有 2,50 个参数,另一个有 3,500 个参数,查询包含 4000 万条记录的表。
我使用标准
WHERE IN
进行查询花了 5 分钟。通过对 IN 语句使用子查询(将参数放入其自己的索引表中),我将查询时间缩短到了两秒。根据我的经验,曾在 MySQL 和 Oracle 工作过。
Using
IN
with a large parameter set on a large list of records will in fact be slow.In the case that I solved recently I had two where clauses, one with 2,50 parameters and the other with 3,500 parameters, querying a table of 40 Million records.
My query took 5 minutes using the standard
WHERE IN
. By instead using a subquery for the IN statement (putting the parameters in their own indexed table), I got the query down to TWO seconds.Worked for both MySQL and Oracle in my experience.
IN
很好,而且优化得很好。确保在索引字段上使用它就可以了。它在功能上等同于:
就数据库引擎而言。
编辑:请注意这个答案是在 2011 年写的,并查看这个答案的评论,讨论最新的 MySQL 功能。
IN
is fine, and well optimized. Make sure you use it on an indexed field and you're fine.It's functionally equivalent to:
As far as the DB engine is concerned.
EDIT: Please notice this answer was written in 2011, and see the comments of this answer discussing the latest MySQL features.
当您为
IN
运算符提供许多值时,它首先必须对其进行排序以删除重复项。至少我怀疑是这样。因此提供太多值并不好,因为排序需要 N log N 时间。我的经验证明,将值集分割成更小的子集并组合应用程序中所有查询的结果可以提供最佳性能。我承认我在不同的数据库(Pervasive)上积累了经验,但这同样适用于所有引擎。我每组的值计数是 500-1000。或多或少明显变慢了。
When you provide many values for the
IN
operator it first must sort it to remove duplicates. At least I suspect that. So it would be not good to provide too many values, as sorting takes N log N time.My experience proved that slicing the set of values into smaller subsets and combining the results of all the queries in the application gives best performance. I admit that I gathered experience on a different database (Pervasive), but the same may apply to all the engines. My count of values per set was 500-1000. More or less was significantly slower.