如何计算一个字符串在定义的字符串范围内的距离?
给定一个由两个字符串 [x, y] 和它们之间的第三个字符串 s 定义的区间,有没有办法计算从 x 到 s 的整个区间的百分比。最好遵循排序规则(例如,大小写重要与否)。一个大概的答案是合理的。
例如,在正常的 Latin-1 排序规则中,给定字符串“a”和“c”,“b”位于中间,因此我们预计答案为 50%。
显而易见但错误的方法就是相信编码会占上风。不幸的是,忽略了这样一个事实:在不区分大小写的排序规则中,'B' 位于区间 ['a', 'c'] 中,并且等同于 'b',即使 'B' 被编码为比 ' 更高的数字c'。因此,编码没有这些信息,除非我们进行一些标准化,这可能会很昂贵。
我希望有人想到更好的方法。这似乎应该在数据库实现中经常出现,但我没有在文献或网上看到任何暗示这一点的内容。公平地说,我完全有可能在错误的地方和错误的名字下查找。字符串距离问题似乎主要由编辑距离主导,而不是这种与排序规则相关的距离。
除了排序规则之外,问题也可能取决于编码。在这种情况下,我最感兴趣的是各种 UTF 编码。
Given an interval defined by two strings, [x, y], and third string s between them, is there a way to calculate the percentage of the whole interval from x to s. Preferably which honors collation (case matters vs not, for instance). An approximate answer is reasonable.
For example, given the strings 'a' and 'c', 'b' is halfway across, in the normal Latin-1 collation, so we'd expect an answer of 50%.
The obvious, and wrong, way is just to trust the encoding to carry the day. Unfortunately that ignores the fact the in a case insensitive collation, 'B' is in the interval ['a', 'c'], and is equivalent to 'b', even though 'B' is encoded as a higher number than 'c'. So the encoding doesn't have this information unless we go through some normalization, which might be expensive.
I'm hoping someone has thought of a better way. It seems like something that should come up in database implementation quite a bit, but I haven't seen anything in the literature, or online, alluding to this. To be fair, it's entirely possible I'm looking in the wrong places and under the wrong names. String distance questions seem to be dominated by edit distance, not this sort of collation related distance.
It's also possible that the question depends on the encoding, in addition to the collation. In that case, I'm most interested in the various UTF encodings.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论