MySQL排名,更新频繁、数据集大,如何获得最佳性能?
我想在一个非常大的表格上进行分组排名,我找到了这个问题的几个解决方案,例如 这篇文章和网络上的其他地方。 然而,我无法弄清楚这些解决方案最坏情况的复杂性。 具体问题由一个表组成,其中每行都有多个点和关联的名称。 我希望能够请求排名区间,例如 1-4。 以下是一些数据示例:
name | points
Ab 14
Ac 14
B 16
C 16
Da 15
De 13
使用这些值,创建以下“排名”:
Query id | Rank | Name
1 1 B
2 1 C
3 3 Da
4 4 Ab
5 4 Ac
6 6 De
并且应该可以在查询 ID 上创建以下间隔:2-5 给出排名:1、3、4 和 4。
数据库保存大约300 万条记录,因此如果可能的话,我想避免复杂性大于 log(n) 的解决方案。 数据库不断更新和插入,因此这些操作最好也以 log(n) 复杂度执行。 但我不确定这是否可能,并且我已经尝试过一段时间了。 我得出的结论是二分搜索应该是可能的,但我无法创建执行此操作的查询。 我正在使用 MySQL 服务器。
我将详细说明过滤的伪代码如何工作。 首先,需要一个关于(点,名称)的索引。 作为输入,您给出 fromrank 和tilrank。 数据库中的记录总数为n。 伪代码应如下所示:
查找中值点值,计算小于该值的行数(计数给出了排名的粗略估计,不考虑具有相同点数的行)。 如果返回的数字大于 fromrank 分隔符,我们将前半部分细分并找到它的中位数。 我们继续这样做,直到确定 fromrank 应该开始的点数。 然后我们在该数量的点内使用名称索引执行相同的操作,并找到中位数,直到到达正确的行。 我们对tilrank 做了完全相同的事情。
结果应该是 log(n) 细分数。 因此,鉴于中位数和计数可以在 log(n) 时间内完成,应该可以在最坏情况复杂度 log(n) 下解决问题。 如果我错了请纠正我。
I want grouped ranking on a very large table, I've found a couple of solutions for this problem e.g. in this post and other places on the web. I am, however, unable to figure out the worst case complexity of these solutions. The specific problem consists of a table where each row has a number of points and a name associated. I want to be able to request rank intervals such as 1-4. Here are some data examples:
name | points
Ab 14
Ac 14
B 16
C 16
Da 15
De 13
With these values the following "ranking" is created:
Query id | Rank | Name
1 1 B
2 1 C
3 3 Da
4 4 Ab
5 4 Ac
6 6 De
And it should be possible to create the following interval on query-id's: 2-5 giving rank: 1,3,4 and 4.
The database holds about 3 million records so if possible I want to avoid a solution with complexity greater than log(n). There are constantly updates and inserts on the database so these actions should preferably be performed in log(n) complexity as well. I am not sure it's possible though and I've tried wrapping my head around it for some time. I've come to the conclusion that a binary search should be possible but I haven't been able to create a query that does this. I am using a MySQL server.
I will elaborate on how the pseudo code for the filtering could work. Firstly, an index on (points, name) is needed. As input you give a fromrank and a tillrank. The total number of records in the database is n. The pseudocode should look something like this:
Find median point value, count rows less than this value (the count gives a rough estimate of rank, not considering those with same amount of points). If the number returned is greater than the fromrank delimiter, we subdivide the first half and find median of it. We keep doing this until we are pinpointed to the amount of points where fromrank should start. then we do the same within that amount of points with the name index, and find median until we have reached the correct row. We do the exact same thing for tillrank.
The result should be log(n) number of subdivisions. So given the median and count can be made in log(n) time it should be possible to solve the problem in worst case complexity log(n). Correct me if I am wrong.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您需要一个存储过程才能使用参数调用它:
如果您创建索引并强制
MySQL
使用它(如在我的查询中),那么查询的复杂性将不依赖于行数,它仅取决于tillrank
。它实际上会从索引中获取最后的
tillrank
值,对它们执行一些简单的计算,并过滤掉第一个fromrank
值。正如您所看到的,此操作的时间仅取决于
tillrank
,而不取决于有多少条记录。我刚刚签入了
400,000
行,它在0,004
秒内选择了从5
到100
的排名(也就是说,立即)重要:仅当您按
DESCENDING
顺序对名称进行排序时,此方法才有效。MySQL
不支持索引中的DESC
子句,这意味着points
和name
必须排序在一个INDEX SORT
可用的顺序(要么都是ASCENDING
,要么都是DESCENDING
)。 如果您想按名称
快速进行ASC
排序,则需要在数据库中保留负点,并更改中的符号SELECT
子句。您还可以从索引中删除ORDER 操作:
name
,并在不使用索引的情况下执行最终的这会影响大范围内的性能,但您几乎不会注意到它在小范围内。
You need a stored procedure to be able to call this with parameters:
If you create the index and force
MySQL
to use it (as in my query), then the complexity of the query will not depend on the number of rows at all, it will depend only ontillrank
.It will actually take last
tillrank
values from the index, perform some simple calculations on them and filter out firstfromrank
values.Time of this operation, as you can see, depends only on
tillrank
, it does not depend on how many records are there.I just checked in on
400,000
rows, it selects ranks from5
to100
in0,004
seconds (that is, instantly)Important: this only works if you sort on names in
DESCENDING
order.MySQL
does not supportDESC
clause in the indices, that means that thepoints
andname
must be sorted in one order forINDEX SORT
to be usable (either bothASCENDING
or bothDESCENDING
). If you want fastASC
sorting byname
, you will need to keep negative points in the database, and change the sign in theSELECT
clause.You may also remove
name
from the index at all, and perform a finalORDER
'ing without using an index:That will impact performance on big ranges, but you will hardly notice it on small ranges.