MySQL 中的自然排序
有没有一种优雅的方法可以在 MySQL 数据库中实现高性能、自然的排序?
例如,如果我有以下数据集:
- Final Fantasy
- 最终幻想 4
- Final Fantasy 10
- Final Fantasy 12
- Final Fantasy 12: Chains of Promathia
- Final Fantasy Adventure
- Final Fantasy Origins
- Final Fantasy Tactics
除了拆分之外的任何其他优雅解决方案将游戏名称添加到其组件中
- 标题:“最终幻想”
- 编号:“12”
- 副标题:“Promathia 之链”
以确保它们以正确的顺序出现? (10 在 4 之后,而不是在 2 之前)。
这样做是一件很痛苦的事情,因为时不时就会有另一款游戏打破游戏标题的解析机制(例如“战锤 40,000”、“詹姆斯·邦德 007”)
Is there an elegant way to have performant, natural sorting in a MySQL database?
For example if I have this data set:
- Final Fantasy
- Final Fantasy 4
- Final Fantasy 10
- Final Fantasy 12
- Final Fantasy 12: Chains of Promathia
- Final Fantasy Adventure
- Final Fantasy Origins
- Final Fantasy Tactics
Any other elegant solution than to split up the games' names into their components
- Title: "Final Fantasy"
- Number: "12"
- Subtitle: "Chains of Promathia"
to make sure that they come out in the right order? (10 after 4, not before 2).
Doing so is a pain in the a** because every now and then there's another game that breaks that mechanism of parsing the game title (e.g. "Warhammer 40,000", "James Bond 007")
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(23)
刚刚发现:
当数字位于前面时进行自然排序,也可能适用于中间。
Just found this:
Does a natural sort when the numbers are at the front, might work for middle as well.
我不久前为 MSSQL 2000 编写了这个函数:
I've written this function for MSSQL 2000 a while ago:
我认为这就是为什么很多事情都是按发布日期排序的。
解决方案可能是在表中为“SortKey”创建另一列。 这可能是标题的净化版本,符合您为轻松排序或计数器而创建的模式。
I think this is why a lot of things are sorted by release date.
A solution could be to create another column in your table for the "SortKey". This could be a sanitized version of the title which conforms to a pattern you create for easy sorting or a counter.
与 @plalx 发布的功能相同,但重写为 MySQL:
用法:
Same function as posted by @plalx, but rewritten to MySQL:
Usage:
关于 Richard Toth 的最佳回应 https://stackoverflow.com/a/12257917/4052357
注意 UTF8 编码的字符串包含2字节(或更多)字符和数字,例如
在
udf_NaturalSortFormat
函数中使用MySQL的LENGTH()
将返回字符串的字节长度并且不正确,而是使用CHAR_LENGTH()
将返回正确的字符长度。在我的例子中,使用
LENGTH()
导致查询永远无法完成,并导致 MySQL 的 CPU 使用率达到 100%。 我会将此添加为原始评论,但我还没有足够的声誉
Regarding the best response from Richard Toth https://stackoverflow.com/a/12257917/4052357
Watch out for UTF8 encoded strings that contain 2byte (or more) characters and numbers e.g.
Using MySQL's
LENGTH()
inudf_NaturalSortFormat
function will return the byte length of the string and be incorrect, instead useCHAR_LENGTH()
which will return the correct character length.In my case using
LENGTH()
caused queries to never complete and result in 100% CPU usage for MySQLp.s. I would have added this as a comment to the original but I don't have enough reputation (yet)
在表格中添加排序键(排名)。
ORDER BY排名
利用“发布日期”列。
ORDER BY release_date
当从 SQL 中提取数据时,让你的对象进行排序,例如,如果提取到 Set 中,则将其设为 TreeSet,并使你的数据模型实现 Comparable 并制定 集合中时,您将从 SQL 中逐一读取行)
Add a Sort Key (Rank) in your table.
ORDER BY rank
Utilise the "Release Date" column.
ORDER BY release_date
When extracting the data from SQL, make your object do the sorting, e.g., if extracting into a Set, make it a TreeSet, and make your data model implement Comparable and enact the natural sort algorithm here (insertion sort will suffice if you are using a language without collections) as you'll be reading the rows from SQL one by one as you create your model and insert it into the collection)
订购:
0
1
2
10
23
101
205
1000
一个
aac
b
卡萨萨萨
css
使用此查询:
To order:
0
1
2
10
23
101
205
1000
a
aac
b
casdsadsa
css
Use this query:
添加一个“排序键”字段,将所有数字字符串用零填充到固定长度,然后在该字段上进行排序。
如果您可能有很长的数字字符串,另一种方法是在每个数字字符串前面添加数字位数(固定宽度、零填充)。 例如,如果连续数字不超过 99 位,则对于“Super Blast 10 Ultra”,排序键将为“Super Blast 0210 Ultra”。
Add a field for "sort key" that has all strings of digits zero-padded to a fixed length and then sort on that field instead.
If you might have long strings of digits, another method is to prepend the number of digits (fixed-width, zero-padded) to each string of digits. For example, if you won't have more than 99 digits in a row, then for "Super Blast 10 Ultra" the sort key would be "Super Blast 0210 Ultra".
如果您不想重新发明轮子或对大量不起作用的代码感到头疼,只需使用 Drupal Natural Sort ...只需运行压缩的 SQL(MySQL 或 Postgre),就这样。 进行查询时,只需使用以下命令即可:
If you do not want to reinvent the wheel or have a headache with lot of code that does not work, just use Drupal Natural Sort ... Just run the SQL that comes zipped (MySQL or Postgre), and that's it. When making a query, simply order using:
另一种选择是从 mysql 拉取数据后在内存中进行排序。 虽然从性能的角度来看这不是最好的选择,但如果您不对巨大的列表进行排序,那么应该没问题。
如果您查看 Jeff 的帖子,您可以找到适合您可能使用的任何语言的大量算法。
人类排序:自然排序
Another option is to do the sorting in memory after pulling the data from mysql. While it won't be the best option from a performance standpoint, if you are not sorting huge lists you should be fine.
If you take a look at Jeff's post, you can find plenty of algorithms for what ever language you might be working with.
Sorting for Humans : Natural Sort Order
您还可以以动态方式创建“排序列”:
这样,您可以创建要排序的组。
在我的查询中,我希望所有内容前面都有“-”,然后是数字,然后是文本。 这可能会导致类似的结果:
这样,您在添加数据时不必以正确的顺序维护排序列。 您还可以根据需要更改排序顺序。
You can also create in a dynamic way the "sort column" :
That way, you can create groups to sort.
In my query, I wanted the '-' in front of everything, then the numbers, then the text. Which could result in something like :
That way you don't have to maintain the sort column in the correct order as you add data. You can also change your sort order depending on what you need.
我在这里看到的许多其他答案(以及在重复的问题中)基本上只适用于非常特定格式的数据,例如完全是数字的字符串,或者有固定长度的字母前缀。 这在一般情况下是行不通的。
确实,没有任何方法可以在 MySQL 中实现 100% 通用 nat-sort,因为要做到这一点,您真正需要的是修改后的比较函数,该函数可以切换如果/当遇到数字时,介于字符串的字典排序和数字排序之间。 此类代码可以实现您想要识别和比较两个字符串中的数字部分的任何算法。 但不幸的是,MySQL 中的比较函数是其代码内部的,用户无法更改。
这会留下某种黑客行为,您尝试为字符串创建一个排序键,其中数字部分被重新格式化,以便标准词典排序实际上按照您的方式对它们进行排序想要。
对于最大位数的普通整数,明显的解决方案是简单地用零填充它们,以便它们都是固定宽度的。 这是 Drupal 插件采用的方法,也是 @plalx / @RichardToth 的解决方案。 (@Christian 有一个不同的、更复杂的解决方案,但它没有提供我所看到的优势)。
正如 @tye 指出的,您可以通过在每个数字前面添加固定数字长度来改进这一点,而不是简单地向左填充它。 不过,即使考虑到本质上是一个尴尬的黑客攻击的局限性,您还有很多很多可以改进的地方。 然而,似乎没有任何预先构建的解决方案!
例如:
扩展@tye的方法,我创建了一个相当紧凑的NatSortKey()存储函数,它将把任意字符串转换为nat排序键,并且处理所有上述情况,相当高效,并保留总排序-order(没有两个不同的字符串具有比较相等的排序键)。 第二个参数可用于限制每个字符串中处理的数字数量(例如,前 10 个数字),这可用于确保输出适合给定长度。
注意:使用第二个参数的给定值生成的排序键字符串只能与使用该参数的相同值生成的其他字符串进行排序,否则它们可能无法正确排序!< /strong>
您可以直接在排序中使用它,例如,
但是为了对大型表进行有效排序,最好将排序键预先存储在另一列中(可能带有索引):
[理想情况下,您可以通过将键列创建为计算存储列来自动实现这一点,使用如下内容:
但目前 MySQL 和 MariaDB 都不允许在计算列中存储函数,所以不幸的是 < strong>您还不能执行此操作。]
我的函数仅影响数字排序。 如果您想要执行其他排序规范化操作,例如删除所有标点符号,或修剪两端的空格,或用单个空格替换多空格序列,您可以扩展该函数,也可以在 < 之前或之后完成code>NatSortKey() 应用于您的数据。 (我建议使用
REGEXP_REPLACE()
为此目的)。这也有点以英语为中心,因为我认为“。” 对于小数点和“,”对于千位分隔符,但是如果您想要相反,或者如果您希望它可以作为参数进行切换,那么修改它应该很容易。
可能可以通过其他方式进一步改进; 例如,它当前按绝对值对负数进行排序,因此 -1 位于 -2 之前,而不是相反。 也无法指定数字的 DESC 排序顺序,同时保留文本的 ASC 字典排序。 这两个问题都可以通过更多的工作来解决; 如果/当我有时间时,我会更新代码。
还有许多其他细节需要注意 - 包括对您正在使用的 chaset 和排序规则的一些关键依赖项 - 但我已将它们全部放入 SQL 代码中的注释块中。 请在亲自使用该函数之前仔细阅读本文!
所以,这是代码。 如果您发现错误,或者有我没有提到的改进,请在评论中告诉我!
A lot of other answers I see here (and in the duplicate questions) basically only work for very specifically formatted data, e.g. a string that's entirely a number, or for which there's a fixed-length alphabetic prefix. This isn't going to work in the general case.
It's true that there's not really any way to implement a 100% general nat-sort in MySQL, because to do it what you really need is a modified comparison function, that switches between lexicographic sorting of the strings and numeric sort if/when it encounters a number. Such code could implement any algorithm you could desire for recognising and comparing the numeric portions within two strings. Unfortunately, though, the comparison function in MySQL is internal to its code, and cannot be changed by the user.
This leaves a hack of some kind, where you try to create a sort key for your string in which the numeric parts are re-formatted so that the standard lexicographic sort actually sorts them the way you want.
For plain integers up to some maximum number of digits, the obvious solution is to simply left-pad them with zeros so that they're all fixed width. This is the approach taken by the Drupal plugin, and the solutions of @plalx / @RichardToth. (@Christian has a different and much more complex solution, but it offers no advantages that I can see).
As @tye points out, you can improve on this by prepending a fixed-digit length to each number, rather than simply left-padding it. There's much, much more you can improve on, though, even given the limitations of what is essentially an awkward hack. Yet, there doesn't seem to be any pre-built solutions out there!
For example, what about:
Extending on @tye's method, I've created a fairly compact NatSortKey() stored function that will convert an arbitrary string into a nat-sort key, and that handles all of the above cases, is reasonably efficient, and preserves a total sort-order (no two different strings have sort keys that compare equal). A second parameter can be used to limit the number of numbers processed in each string (e.g. to the first 10 numbers, say), which can be used to ensure the output fits within a given length.
NOTE: Sort-key string generated with a given value of this 2nd parameter should only be sorted against other strings generated with the same value for the parameter, or else they might not sort correctly!
You can use it directly in ordering, e.g.
But for efficient sorting of large tables, it's better to pre-store the sort key in another column (possibly with an index on it):
[Ideally, you'd make this happen automatically by creating the key column as a computed stored column, using something like:
But for now neither MySQL nor MariaDB allow stored functions in computed columns, so unfortunately you can't yet do this.]
My function affects sorting of numbers only. If you want to do other sort-normalization things, such as removing all punctuation, or trimming whitespace off each end, or replacing multi-whitespace sequences with single spaces, you could either extend the function, or it could be done before or after
NatSortKey()
is applied to your data. (I'd recommend usingREGEXP_REPLACE()
for this purpose).It's also somewhat Anglo-centric in that I assume '.' for a decimal point and ',' for the thousands-separator, but it should be easy enough to modify if you want the reverse, or if you want that to be switchable as a parameter.
It might be amenable to further improvement in other ways; for example it currently sorts negative numbers by absolute value, so -1 comes before -2, rather than the other way around. There's also no way to specify a DESC sort order for numbers while retaining ASC lexicographical sort for text. Both of these issues can be fixed with a little more work; I will updated the code if/when I get the time.
There are lots of other details to be aware of - including some critical dependencies on the chaset and collation that you're using - but I've put them all into a comment block within the SQL code. Please read this carefully before using the function for yourself!
So, here's the code. If you find a bug, or have an improvement I haven't mentioned, please let me know in the comments!
我知道这个话题很古老,但我想我已经找到了一种方法来做到这一点:
废弃那个,它错误地对以下集合进行了排序(这是无用的哈哈):
最终幻想1
最终幻想2
最终幻想5
最终幻想7
最终幻想 7:降临之子
最终幻想12
最终幻想112
FF1
FF2
I know this topic is ancient but I think I've found a way to do this:
Scrap that, it sorted the following set incorrectly (It's useless lol):
Final Fantasy 1
Final Fantasy 2
Final Fantasy 5
Final Fantasy 7
Final Fantasy 7: Advent Children
Final Fantasy 12
Final Fantasy 112
FF1
FF2
该函数将为给定的输入字符串生成自然排序键。
该函数接受输入字符串,修剪前导和尾随空格,将其转换为小写以保持一致性,然后通过连接以下组件创建排序键:
SOUNDEX 值:SOUNDEX 值是以下内容的语音表示形式:输入字符串,这有助于捕获听起来相似的字符串。
字符串长度:原始字符串的长度附加到排序键。 这可确保当 SOUNDEX 值相同时,字符串按长度排序。
原始字符串:最后,将原始字符串附加到排序键。 这可确保在相同长度和 SOUNDEX 值的字符串内,保持字典顺序。
此函数对于需要考虑语音相似性和字符串长度的自然排序顺序的场景特别有用。
用法:
上面的查询将根据输入字符串“示例字符串”所描述的算法返回排序键。
This function will generate a natural sort key for a given input string.
The function takes an input string, trims leading and trailing whitespaces, converts it to lowercase for consistency, and then creates a sort key by concatenating the following components:
SOUNDEX Value: The SOUNDEX value is a phonetic representation of the input string, which helps in capturing similar-sounding strings.
Length of the String: The length of the original string is appended to the sort key. This ensures that strings are sorted by length when their SOUNDEX values are identical.
Original String: Finally, the original string is appended to the sort key. This ensures that within strings of the same length and SOUNDEX value, the lexicographical order is maintained.
This function is particularly useful for scenarios where a natural sorting order is desired, considering both phonetic similarity and string length.
Usage:
The above query will return a sort key based on the described algorithm for the input string 'Example String'.
我尝试了几种解决方案,但实际上很简单:
I have tried several solutions but the actually it is very simple:
还有natsort。 它旨在成为 drupal 插件 的一部分,但它可以独立运行。
Also there is natsort. It is intended to be a part of a drupal plugin, but it works fine stand-alone.
如果标题仅将版本作为数字,则这是一个简单的方法:
否则,如果您使用模式,则可以使用简单的 SQL(此模式在版本之前使用 #):
如果需要,您可以使用其他模式。
例如,如果您有一部电影“我#1”和“我#1 第 2 部分”,则可以包装该版本,例如“最终幻想 {11}”
Here is a simple one if titles only have the version as a number:
Otherwise you can use simple SQL if you use a pattern (this pattern uses a # before the version):
You can use other patterns if needed.
For example if you have a movie "I'm #1" and "I'm #1 part 2" then maybe wrap the version e.g. "Final Fantasy {11}"
@plaix/Richard Toth/Luke Hoggett 最佳响应的简化非 udf 版本仅适用于字段中的第一个整数,如下
A simplified non-udf version of the best response of @plaix/Richard Toth/Luke Hoggett, which works only for the first integer in the field, is
其他答案是正确的,但您可能想知道 MariaDB 10.11 LTS 有一个
natural_sort_key()
函数。 该函数记录在此处。Other answers are correct, but you may want to know that MariaDB 10.11 LTS has a
natural_sort_key()
function. The function is documented here.如果您使用 PHP,您可以在 php 中进行自然排序。
我希望MySQL在未来的版本中实现自然排序,但是功能请求(#1588)< /a> 自 2003 年以来一直开放,所以我不会屏住呼吸。
If you're using PHP you can do the the natural sort in php.
I hope MySQL will implement natural sorting in a future version, but the feature request (#1588) is open since 2003, So I wouldn't hold my breath.
因此,虽然我知道您已经找到了满意的答案,但我在这个问题上挣扎了一段时间,而且我们之前已经确定它无法在 SQL 中很好地完成,因此我们将不得不在 JSON 上使用 javascript大批。
以下是我仅使用 SQL 解决该问题的方法。 希望这对其他人有帮助:
我有这样的数据:
我实际上没有“投射”东西,尽管我认为这也可能有效。
我首先替换了数据中不变的部分,在本例中为“场景”,然后进行了 LPAD 来排列内容。 这似乎可以很好地对字母字符串和编号字符串进行正确排序。
我的 ORDER BY 子句看起来像:
显然这对解决不太统一的原始问题没有帮助 - 但我想这可能适用于许多其他相关问题,所以把它放在那里。
So, while I know that you have found a satisfactory answer, I was struggling with this problem for awhile, and we'd previously determined that it could not be done reasonably well in SQL and we were going to have to use javascript on a JSON array.
Here's how I solved it just using SQL. Hopefully this is helpful for others:
I had data such as:
I actually didn't "cast" things though I suppose that may also have worked.
I first replaced the parts that were unchanging in the data, in this case "Scene ", and then did a LPAD to line things up. This seems to allow pretty well for the alpha strings to sort properly as well as the numbered ones.
My
ORDER BY
clause looks like:Obviously this doesn't help with the original problem which was not so uniform - but I imagine this would probably work for many other related problems, so putting it out there.
MySQL 不允许这种“自然排序”,因此看起来获得您想要的内容的最佳方法是按照上面所述拆分数据设置(单独的 id 字段等),否则会失败即,根据非标题元素、数据库中的索引元素(日期、数据库中插入的 ID 等)执行排序。
让数据库为您进行排序几乎总是比将大型数据集读入您选择的编程语言并在那里进行排序要快,因此,如果您对此处的数据库模式有任何控制,那么请考虑添加如上所述,可以轻松排序字段,从长远来看,它将为您节省很多麻烦和维护工作。
MySQL bug 和 讨论论坛,许多解决方案都围绕剥离取出数据的特定部分并将它们转换为查询的
ORDER BY
部分,例如这种解决方案可以在上面的最终幻想示例中使用,但不是特别灵活恐怕不太可能完全扩展到包括“战锤 40,000”和“詹姆斯·邦德 007”在内的数据集。
MySQL doesn't allow this sort of "natural sorting", so it looks like the best way to get what you're after is to split your data set up as you've described above (separate id field, etc), or failing that, perform a sort based on a non-title element, indexed element in your db (date, inserted id in the db, etc).
Having the db do the sorting for you is almost always going to be quicker than reading large data sets into your programming language of choice and sorting it there, so if you've any control at all over the db schema here, then look at adding easily-sorted fields as described above, it'll save you a lot of hassle and maintenance in the long run.
Requests to add a "natural sort" come up from time to time on the MySQL bugs and discussion forums, and many solutions revolve around stripping out specific parts of your data and casting them for the
ORDER BY
part of the query, e.g.This sort of solution could just about be made to work on your Final Fantasy example above, but isn't particularly flexible and unlikely to extend cleanly to a dataset including, say, "Warhammer 40,000" and "James Bond 007" I'm afraid.
这是一个快速解决方案:
Here is a quick solution: