获取字符串的数字/规范化表示以帮助“自然排序” DB 中的标题数
我想在表中存储一个附加列作为“排序值”,它是标题列的数字表示形式,这样这些值的顺序代表字符串的自然字母排序顺序。 即,这样我就可以检索按排序值排序的行,并且它们将按自然排序顺序 - 当我插入新行时,我可以生成数值并知道相对于其他值的值将代表字符串的位置按字母顺序搜索,精确到前 X 个字母左右。
这样做有几个原因:首先,我想要比数据库服务器提供的简单排序更自然的排序,其中“The”和“A”以及标点符号之类的内容在开头被忽略,数字被“自然”地对待'。
其次,这适用于具有大量排列的索引 - 它将节省空间,并且在遍历具有许多行的索引时可能还节省时间。
我想要的是将字符串转换为该数值的算法,或者我认为只是一个标准化的字符串值。
我正在使用 PHP 和 MySQL。
我担心“从数据库中提取所有内容并使用 natcasesort() 在 PHP 中排序”并不是这种特殊情况的解决方案,因为我想在行之前按排序顺序检索行(使用 order by 和 group by)获得连接或限制子句。 谢谢。
编辑:
感谢您迄今为止的回答。 我突然想到我的应用程序使用 UTF-8 的事实是非常相关的。 话虽如此,我认为以压缩/数字形式表示字符串的初始部分的实用性是一种延伸,也许只是某种标准化形式(所有内容都大小写折叠,数字零填充,以及尽可能多的字符)归一化为它们的根,即 ã 到 a) 是合适的。
I'd like to store an additional column in a table as a 'sort value', which is a numeric representation of the title column, such that the order of such values represents the string's natural alphabetical sort order. Ie, so that I can retrieve rows ordered by the sort value, and they'll be in natural sort order - and when I insert a new row, I can generate the numeric value and know that value relative to others will represent the string's position in an alphabetic search, accurate to the first X letters or so.
A couple of reasons for this: firstly, I would like a more natural ordering than a plain ordering offered by a DB server, where things like "The" and "A" and punctuation are ignored at the start, and numbers are treated 'naturally'.
Secondly, this is for an index with a lot of permutations - it will save space, and perhaps time when traversing an index with many rows.
What I am after for is the algorithm to translate the string to that numeric value, or just, I suppose, a normalised string value.
I am using PHP and MySQL.
I'm afraid that "pull everything from the DB and sort in PHP using natcasesort()" is not a solution for this particular situation, as I'd like to retrieve rows (using order by and group by) in sorted order before they get to a join or limit clause. Thanks.
Edit:
Thanks for answers so far. It's just occurred to me that the fact my application uses UTF-8 is quite relevant. With that said, I think the practicality of representing the initial part of a string in a packed/numeric form is a stretch, maybe just some sort of normalised form (everything case-folded, numbers zero-padded, and as many characters as possible normalised to their root ie ã to a) would be appropriate.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
“精确到前 X 个字母左右”部分至关重要,因为完全准确的数字分配是不可能的。 要看到这一点,具体假设您的
title
列是varchar(50)
并且您想要使用 32 位integer
sort_order
列。 然后,您可以存储 (255^51 - 1) 个不同的标题,每个标题都需要不同的sort_order
值 - 但只有 2^32 个不同的sort_order
值可以存储四处走走。 即使您说您永远不会添加超过 2^32 行,您也需要提前知道它们将具有哪些标题,以便提出一个避免重新分配所有sort_order
值的方案每次插入一行时。尽管“理论上完美”的解决方案是不可能的,但仍然有可能获得一个实用的“近似”系统,该系统应该能够以完美的精度工作,最多可处理数百万行。 最简单的方法是使用浮点类型。 首先,按排序顺序列出行,并将第一行的
sort_order
值指定为 1.0,将第二行的值指定为 2.0,依此类推。 然后,每当插入一行时,将其sort_order
设置为排序顺序两侧行的中点(即平均值)。 如果新添加的行位于所有现有行之前(或之后),只需将其设置为比之前的最小(或最大)sort_order
值小(或多)1。最好从头开始重新分配数字(如在初始构建步骤中)以定期或在大量更新后“平滑”值。 特别是如果表格开始时很小然后变大,您可能会在末尾发现一些数字“聚集”。
The part "accurate to the first X letters or so" is crucial, since a completely accurate assignment of numbers is impossible. To see this, suppose for concreteness that your
title
column isvarchar(50)
and you want to use a 32-bitinteger
sort_order
column. Then you could store (255^51 - 1) different titles, each of which would require a differentsort_order
value -- but there are only 2^32 differentsort_order
values to go around. Even if you said you would never add more than 2^32 rows, you would need to know in advance which titles they would have in order to come up with a scheme that avoided having to reassign allsort_order
values every time a row was inserted.Although a "theoretically perfect" solution is impossible, it's still possible to get a practical "approximate" system that should work with perfect accuracy for up to many millions of rows. The simplest way would be to use a floating-point type. Initially, list out the rows in sorted order and assign the first row a
sort_order
value of 1.0, the second a value of 2.0 and so on. Then, whenever a row is inserted, set itssort_order
to the midpoint (that is, the average) of the rows on either side in sorted order. If the newly added row comes before (or after) all existing rows, just set it to 1 less than (or more than) the previous minimum (or maximum)sort_order
value.It's a good idea to reassign numbers from scratch (as in the initial build step) to "smooth out" the values periodically, or after a large number of updates. Particularly if the table starts small and then gets big, you may find some "bunching" of numbers at the ends.
感谢到目前为止的回答。 我只是想向人们通报我正在使用的解决方案。 我采取的方法与我在问题中设想的方法不同。
回顾一下,我想存储字符串的表示形式,这样当以二进制顺序检索时,我为“8 Mile”存储的任何内容都将排在我为“101 Dalmations”存储的任何内容之前。
对于字符串中的每个数字(本质上是数字序列),我在它们前面插入一个数字来描述该数字有多少位。
因此,“8”变为“18”,“101”变为“3101”。 它为数字添加了一些冗余,因为您使用的数字超出了您的需要,并且某些值将不存在,但它们现在具有二进制排序将数字按数字顺序排序的属性。 “101”会预先排在“8”之前,这是不希望的。 添加该额外数字后,“18”排在“3101”之前。
注意:如果数字长度为 9 位或更多,我会在开头添加两位数字:数字中的位数减去 9,然后是 9,然后是数字。 这允许最多 18 位数字:对我来说已经足够了。
我还以其他方式规范化字符串 - 所有内容都为小写,Unicode 字符将被翻译为最接近的 ascii 等效项,并且“a”、“an”和“the”如果是第一个单词,则将被删除。
我放弃了将字符串变成一个大数值; 它仍然是一个字符串,只是它不是为人类阅读而设计的。
Thanks for the answers so far. I just wanted to update people with the solution I'm going with. I've taken an approach that is different from that which I envisaged in my question.
To recap, I wanted to store a representations of strings such that when retrieved in binary order, whatever I stored for "8 Mile" would be sorted before whatever I stored for "101 Dalmations".
For each number in the string, which is essentially a sequence of digits, I insert a digit before them that describes how many digits the number is.
So, "8" becomes "18", and "101" becomes "3101". It adds some redundancy to the number, in that you are using more digits than you need and some values won't exist, but they now have the property that a binary sort will sort the numbers into numerical order. "101" would have sorted before "8" beforehand, which was undesired. After adding that extra digit, "18" sorts before "3101".
Note: if the number is 9 or more digits long, I add two digits to the start: the number of digits in the number minus 9, then a 9, then the number. This allows for numbers up to 18 digits: good enough for me.
I'm also normalising the string in other ways too - everything to lower case, Unicode characters will be translated into the closest ascii equivalent, and 'a', 'an', and 'the' will be stripped if they are the first word.
I gave up on making the string into one big numeric value; it is still a string, it's just that it's not designed for humans to read.