Microsoft SQL 2005 中的自然(人类字母数字)排序
我们有一个大型数据库,在该数据库上有数据库端分页。 速度很快,只需不到一秒的时间即可从数百万条记录中返回 50 行的页面。
用户可以定义自己的排序,基本上是选择排序依据的列。 列是动态的 - 有些具有数值,有些具有日期,有些具有文本。
虽然大多数按预期排序,但文本以一种愚蠢的方式排序。 好吧,我说愚蠢,这对计算机来说有意义,但让用户感到沮丧。
例如,按字符串记录 id 排序会得到类似:
rec1
rec10
rec14
rec2
rec20
rec3
rec4
...等等。
我希望它考虑到数字,所以:
rec1
rec2
rec3
rec4
rec10
rec14
rec20
我无法控制输入(否则我只会以前导 000 进行格式化)并且我不能依赖单一格式 - 有些是“{alpha code}” -{部门代码}-{记录 ID}”。
我知道在 C# 中执行此操作的几种方法,但无法拉下所有记录来对它们进行排序,因为那样会很慢。
有谁知道在 Sql server 中快速应用自然排序的方法吗?
我们正在使用:
ROW_NUMBER() over (order by {field name} asc)
然后我们通过它进行分页。
我们可以添加触发器,尽管我们不会。 他们的所有输入都是参数化的等等,但我无法更改格式 - 如果他们输入“rec2”和“rec10”,他们希望它们像那样以自然顺序返回。
我们拥有针对不同客户采用不同格式的有效用户输入。
一个可能会去rec1,rec2,rec3,...rec100,rec101
而另一个可能会去:grp1rec1,grp1rec2,...grp20rec300,grp20rec301
当我说我们无法控制输入时,我的意思是我们不能强迫用户更改这些标准 - 它们具有类似于 grp1rec1 的值,我无法将其重新格式化为 grp01rec001,因为这将更改用于查找和链接到外部系统的内容。
这些格式差异很大,但通常是字母和数字的混合。
在 C# 中对它们进行排序很容易 - 只需将其分解为 { "grp", 20, "rec", 301 }
,然后依次比较序列值即可。
然而可能有数百万条记录并且数据是分页的,我需要在SQL服务器上完成排序。
SQL Server 按值排序,而不是比较 - 在 C# 中,我可以将值拆分出来进行比较,但在 SQL 中,我需要一些逻辑来(非常快地)获得一致排序的单个值。
@moebius - 你的答案可能有效,但为所有这些文本值添加排序键确实感觉像是一个丑陋的妥协。
We have a large database on which we have DB side pagination. This is quick, returning a page of 50 rows from millions of records in a small fraction of a second.
Users can define their own sort, basically choosing what column to sort by. Columns are dynamic - some have numeric values, some dates and some text.
While most sort as expected text sorts in a dumb way. Well, I say dumb, it makes sense to computers, but frustrates users.
For instance, sorting by a string record id gives something like:
rec1
rec10
rec14
rec2
rec20
rec3
rec4
...and so on.
I want this to take account of the number, so:
rec1
rec2
rec3
rec4
rec10
rec14
rec20
I can't control the input (otherwise I'd just format in leading 000s) and I can't rely on a single format - some are things like "{alpha code}-{dept code}-{rec id}".
I know a few ways to do this in C#, but can't pull down all the records to sort them, as that would be to slow.
Does anyone know a way to quickly apply a natural sort in Sql server?
We're using:
ROW_NUMBER() over (order by {field name} asc)
And then we're paging by that.
We can add triggers, although we wouldn't. All their input is parametrised and the like, but I can't change the format - if they put in "rec2" and "rec10" they expect them to be returned just like that, and in natural order.
We have valid user input that follows different formats for different clients.
One might go rec1, rec2, rec3, ... rec100, rec101
While another might go: grp1rec1, grp1rec2, ... grp20rec300, grp20rec301
When I say we can't control the input I mean that we can't force users to change these standards - they have a value like grp1rec1 and I can't reformat it as grp01rec001, as that would be changing something used for lookups and linking to external systems.
These formats vary a lot, but are often mixtures of letters and numbers.
Sorting these in C# is easy - just break it up into { "grp", 20, "rec", 301 }
and then compare sequence values in turn.
However there may be millions of records and the data is paged, I need the sort to be done on the SQL server.
SQL server sorts by value, not comparison - in C# I can split the values out to compare, but in SQL I need some logic that (very quickly) gets a single value that consistently sorts.
@moebius - your answer might work, but it does feel like an ugly compromise to add a sort-key for all these text values.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
我知道这是一个老问题,但我刚刚遇到它,因为它没有得到公认的答案。
我一直使用与此类似的方法:
唯一常见的问题是您的列不会转换为 VARCHAR(MAX),或者 LEN([Column]) > > 1000(但如果您愿意,您可以将 1000 更改为其他值),但您可以根据您的需要使用这个粗略的想法。
此外,这比正常的 ORDER BY [Column] 性能要差得多,但它确实为您提供了 OP 中要求的结果。
编辑:为了进一步澄清,如果您有十进制值,例如
1
、1.15
和1.5
(它们将排序为{1, 1.5, 1.15}
),因为这不是 OP 中所要求的,但可以通过以下方式轻松完成:结果:
{1, 1.15, 1.5}
而且仍然完全在 SQL 内。 这不会对 IP 地址进行排序,因为您现在进入的是非常具体的数字组合,而不是简单的文本 + 数字。
I know this is an old question but I just came across it and since it's not got an accepted answer.
I have always used ways similar to this:
The only common times that this has issues is if your column won't cast to a VARCHAR(MAX), or if LEN([Column]) > 1000 (but you can change that 1000 to something else if you want), but you can use this rough idea for what you need.
Also this is much worse performance than normal ORDER BY [Column], but it does give you the result asked for in the OP.
Edit: Just to further clarify, this the above will not work if you have decimal values such as having
1
,1.15
and1.5
, (they will sort as{1, 1.5, 1.15}
) as that is not what is asked for in the OP, but that can easily be done by:Result:
{1, 1.15, 1.5}
And still all entirely within SQL. This will not sort IP addresses because you're now getting into very specific number combinations as opposed to simple text + number.
RedFilter 的答案对于索引并不重要的合理大小的数据集非常有用,但是如果您想要索引,则需要进行一些调整。
首先,将函数标记为不进行任何数据访问,并且具有确定性和精确性:
接下来,MSSQL 对索引键大小有 900 字节的限制,因此如果归化值是索引中的唯一值,则它最多不得超过 450长字符。 如果索引包含多列,则返回值必须更小。 两个变化:
在 C# 代码中:
最后,您需要向表中添加计算列,并且必须将其持久化(因为 MSSQL 无法证明
Naturalize
是确定性和精确的),这意味着归化值实际上存储在表中,但仍然自动维护:您现在可以创建索引!
我还对 RedFilter 的代码进行了一些更改:为了清晰起见,使用字符、将重复空格删除合并到主循环中、一旦结果长于限制就退出、设置不带子字符串的最大长度等。结果如下:
RedFilter's answer is great for reasonably sized datasets where indexing is not critical, however if you want an index, several tweaks are required.
First, mark the function as not doing any data access and being deterministic and precise:
Next, MSSQL has a 900 byte limit on the index key size, so if the naturalized value is the only value in the index, it must be at most 450 characters long. If the index includes multiple columns, the return value must be even smaller. Two changes:
and in the C# code:
Finally, you will need to add a computed column to your table, and it must be persisted (because MSSQL cannot prove that
Naturalize
is deterministic and precise), which means the naturalized value is actually stored in the table but is still maintained automatically:You can now create the index!
I've also made a couple of changes to RedFilter's code: using chars for clarity, incorporating duplicate space removal into the main loop, exiting once the result is longer than the limit, setting maximum length without substring etc. Here's the result:
这是为 SQL 2000 编写的解决方案。它可能可以针对较新的 SQL 版本进行改进。
Here's a solution written for SQL 2000. It can probably be improved for newer SQL versions.
我知道这在这一点上有点老了,但在我寻找更好的解决方案时,我遇到了这个问题。 我目前正在使用一个函数来排序。 它可以很好地满足我对以混合字母数字命名的记录(“项目 1”、“项目 10”、“项目 2”等)进行排序的目的,
然后调用
它可以轻松地将简单数据读取的处理时间增加两倍,因此它可能不是完美的解决方案。
I know this is a bit old at this point, but in my search for a better solution, I came across this question. I'm currently using a function to order by. It works fine for my purpose of sorting records which are named with mixed alpha numeric ('item 1', 'item 10', 'item 2', etc)
Then call
It easily triples the processing time for a simple data read, so it may not be the perfect solution.
这是我喜欢的另一个解决方案:
http://www.dreamchain.com/sql-and-alpha -numeric-sort-order/
它不是 Microsoft SQL,但由于我在寻找 Postgres 解决方案时最终来到这里,我认为在这里添加它会对其他人有所帮助。
编辑:这是代码,以防链接消失。
“C”是 postgresql 中的默认排序规则; 您可以指定所需的任何排序规则,或者如果您可以确定表列永远不会分配不确定的排序规则,则可以删除排序规则语句。
用法:
Here is an other solution that I like:
http://www.dreamchain.com/sql-and-alpha-numeric-sort-order/
It's not Microsoft SQL, but since I ended up here when I was searching for a solution for Postgres, I thought adding this here would help others.
EDIT: Here is the code, in case the link goes away.
"C" is the default collation in postgresql; you may specify any collation you desire, or remove the collation statement if you can be certain your table columns will never have a nondeterministic collation assigned.
usage:
只需您排序
Simply you sort by
如果您在从数据库加载数据以在 C# 中进行排序时遇到问题,那么我确信您会对在数据库中以编程方式执行此操作的任何方法感到失望。 当服务器要排序时,它必须像您每次那样计算“感知”顺序。
我建议您在首次插入数据时使用某种 C# 方法添加一个附加列来存储预处理的可排序字符串。 例如,您可能尝试将数字转换为固定宽度范围,因此“xyz1”将变成“xyz00000001”。 然后你就可以使用普通的 SQL Server 排序了。
冒着自吹自擂的风险,我写了一篇 CodeProject 文章,实现了 CodingHorror 文章中提出的问题。 请随意窃取我的代码。
If you're having trouble loading the data from the DB to sort in C#, then I'm sure you'll be disappointed with any approach at doing it programmatically in the DB. When the server is going to sort, it's got to calculate the "perceived" order just as you would have -- every time.
I'd suggest that you add an additional column to store the preprocessed sortable string, using some C# method, when the data is first inserted. You might try to convert the numerics into fixed-width ranges, for example, so "xyz1" would turn into "xyz00000001". Then you could use normal SQL Server sorting.
At the risk of tooting my own horn, I wrote a CodeProject article implementing the problem as posed in the CodingHorror article. Feel free to steal from my code.
对于以下
varchar
数据:这最适合我:
For the following
varchar
data:This worked best for me:
我像往常一样时髦地迟到了。 尽管如此,这是我对一个似乎效果很好的答案的尝试(我会这么说)。 它假定文本末尾带有数字,就像原始示例数据中一样。
首先,这个函数不会很快赢得“漂亮的 SQL”竞赛。
然后像下面这样调用它:
结果:
狐狸
狐狸00
珍0
珍1
珍02
珍3
珍0004
珍15
记录1
记录2
记录3
记录4
记录10
记录14
记录20
袜队
TT01
TT0001
TT002
邮编1
zip1.3
邮编1.32
邮编1.33
I'm fashionably late to the party as usual. Nevertheless, here is my attempt at an answer that seems to work well (I would say that). It assumes text with digits at the end, like in the original example data.
First a function that won't end up winning a "pretty SQL" competition anytime soon.
Then call it like below:
With results:
fox
fox00
Jen0
jen1
jen02
jen3
jen0004
jen15
rec1
rec2
rec3
rec4
rec10
rec14
rec20
sox
TT01
TT0001
TT002
zip1
zip1.3
zip1.32
zip1.33
您可以使用以下代码来解决该问题:
问候,
[电子邮件受保护]
You can use the following code to resolve the problem:
regards,
[email protected]
我刚刚在某处读到一篇关于此类主题的文章。 关键点是:您只需要整数值即可对数据进行排序,而“rec”字符串属于 UI。 您可以将信息拆分为两个字段,例如 alpha 和 num,按 alpha 和 num(分别)排序,然后显示由 alpha + num 组成的字符串。 您可以使用计算列来组成字符串或视图。
希望能帮助到你
I've just read a article somewhere about such a topic. The key point is: you only need the integer value to sort data, while the 'rec' string belongs to the UI. You could split the information in two fields, say alpha and num, sort by alpha and num (separately) and then showing a string composed by alpha + num. You could use a computed column to compose the string, or a view.
Hope it helps
我还是不明白(可能是因为我的英语不好)。
您可以尝试:
但它不适用于数百万条记录。
这就是为什么我建议使用触发器,用填充单独列来人类价值。
而且:
速度慢,微软建议使用
.NET 函数取而代之。
当查询运行时。
I still don't understand (probably because of my poor English).
You could try:
But it won't work for millions of records.
That why I suggested to use trigger which fills separate column with human value.
Moreover:
slow and Microsoft suggest to use
.NET functions instead.
when query runs.
并不完美,但在很多情况下效果很好。
Not perfect, but works well in a lot of cases.
我见过的大多数基于 SQL 的解决方案在数据变得足够复杂时(例如其中超过一两个数字)就会崩溃。 最初,我尝试在 T-SQL 中实现 NaturalSort 函数来满足我的要求(除其他外,处理字符串中的任意数量的数字),但性能方式太慢。
最终,我用 C# 编写了一个标量 CLR 函数,以实现自然排序,即使使用未优化的代码,从 SQL Server 调用它的性能也快得令人眼花缭乱。 它具有以下特征:
代码在这里:
要注册此代码要从 SQL Server 调用它,请在查询分析器中运行以下命令:
然后,您可以像这样使用它:
注意:如果您在 SQL Server 中遇到类似于 的错误.NET Framework 中的用户代码执行被禁用。 启用“clr启用”配置选项。,按照说明此处启用它。 执行此操作之前,请确保考虑安全隐患。 如果您不是数据库管理员,请确保在对服务器配置进行任何更改之前与管理员讨论此问题。
注2:此代码无法正确支持国际化(例如,假设小数点标记为“.”,未针对速度进行优化等。欢迎提出改进建议!
编辑: 将函数重命名为 Naturalize,而不是 NaturalSort,因为它不执行任何实际排序。
Most of the SQL-based solutions I have seen break when the data gets complex enough (e.g. more than one or two numbers in it). Initially I tried implementing a NaturalSort function in T-SQL that met my requirements (among other things, handles an arbitrary number of numbers within the string), but the performance was way too slow.
Ultimately, I wrote a scalar CLR function in C# to allow for a natural sort, and even with unoptimized code the performance calling it from SQL Server is blindingly fast. It has the following characteristics:
The code is here:
To register this so that you can call it from SQL Server, run the following commands in Query Analyzer:
Then, you can use it like so:
Note: If you get an error in SQL Server along the lines of Execution of user code in the .NET Framework is disabled. Enable "clr enabled" configuration option., follow the instructions here to enable it. Make sure you consider the security implications before doing so. If you are not the db admin, make sure you discuss this with your admin before making any changes to the server configuration.
Note2: This code does not properly support internationalization (e.g., assumes the decimal marker is ".", is not optimized for speed, etc. Suggestions on improving it are welcome!
Edit: Renamed the function to Naturalize instead of NaturalSort, since it does not do any actual sorting.