我不可能理解所描述的字符串搜索方法。 uFFFF是什么?
我正在阅读有关在排序的字符串数组中搜索字符串(范围)的内容。
它说:
如果你想查找所有以“h”开头的字符串,你可以运行 对字符串“h”和“h\uFFFF”进行二分搜索。这给出了所有 所有以“h”开头的键的 band 索引。请注意,一个 二分查找可以返回字符串所在的索引,即使 它实际上并不在数组中。
我不明白这一段的任何内容。
h\uFFFF
是什么,它如何帮助/在二分搜索中使用,最后一句话是否也意味着即使这个搜索也是错误的?
请问对理解这里所说的内容有帮助吗?
I am reading something about searching for a (range of) string(s) in a sorted array of strings.
It says:
If you want to find all strings starting with "h", you can run a
binary search for the strings "h" and "h\uFFFF". This gives all the
indexes of the band for all the keys that start with "h". Note that a
binary search can return the index where the string would be even if
it is not actually in the array.
I don't understand anything from this paragraph.
What is h\uFFFF
how does it help/is used in the binary search and does the last sentense also mean that even this search is faulty?
Any help to understand what is being said here please?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
\uFFFF 是 16 位“字母表”中排在最后的“字符”,即在任何有效字母、字符或特殊符号之后。
当您在排序数组中对字符串进行二分搜索时,您会找到可以插入该字符串的位置。当您有多个相同的字符串时,您将获得第一个字符串之前的位置。当您在字符串后面附加“字母表的最后一个字母”时,插入点将位于最后一个相同字符串之后,从而在排序数组中为您提供一系列相同的字符串。
想象一下:假设您不允许在单词中使用字母
Z
。现在您有了一个已排序的字符串数组:如果您搜索
abc
,二分搜索会告诉您可以插入它的第一个位置,即 2。如果您搜索abcZ
>,但是,二分查找将返回 4,因为abcZ
按字母顺序紧接在abc
之后。这让您知道 2(含)和 4(不包括)之间的范围被字符串abc
占据。如果两次搜索返回相同的数字,则您知道该字符串不存在于数组中。在您引用的段落中,
\uFFFF
在我的示例中扮演“禁止字母 Z”的角色。\uFFFF is the "character" that sorts last in the 16-bit "alphabet", i.e. after any valid letter, character, or special symbol.
When you do binary search for a string in a sorted array, you find a place where that string could be inserted. When you have multiple identical strings, you get a location ahead of the first one. When you append "the last letter of the alphabet" after your string, the insertion point will be after the last of the identical strings, hence giving you a range of identical strings in a sorted array.
Imagine this: suppose you are not allowed to use letter
Z
in your words. Now you have a sorted array of strings:If you search for
abc
, binary search tells you the first place where you can insert it, which is 2. If you search forabcZ
, thoug, binary search would return 4, becauseabcZ
comes alphabetically right afterabc
. This lets you know that the range between 2, inclusive, and 4, exclusive, is occupied by the stringabc
. If both searches return the same number, you know that the string is not present in the array.In the paragraph that you quoted,
\uFFFF
plays the role of the "prohibited letter Z" from my example.\uFFFF
是 Java 中最大的字符。由于字符串已排序,因此搜索h
将找到范围的开头,而h\uFFFF
将找到结尾(假设此处为 unicode 字符串),因为无法找到第二个字符大于\uFFFF
。即使它不能完全匹配字符串,搜索也会返回目标所在位置的索引,即使它并不真正存在。更新:
\uFFFF
是 16 位块中最大可能的可排序 unicode 字符,如果您正在使用 32 位块,请使用U+10FFFF
(无论其中的内容是什么)爪哇)。我个人从未在 Java 中使用过 32 位 unicode 块。请参阅5.2.0 规范的第 16.7 节。\uFFFF
is the largest possible character in Java. Since the strings are sorted, searching forh
will find the start of the range whileh\uFFFF
will find the end (assuming unicode strings here) since no second character can be larger than\uFFFF
. Even if it can't match the string exactly, the search will return the index of where the target would be even if it's not really there.update:
\uFFFF
is the largest possible sortable unicode character in 16 bit block, if you are working with 32 bit blocks useU+10FFFF
(whatever that is in Java). I've personally never worked 32 bit unicode blocks in Java. See the section 16.7 of the 5.2.0 spec.Java 中的
\uFFFF
序列表示 Unicode 代码点为 U+FFFF 的字符。但是,代码点根本不编码字符:请参阅以下参考资料:Unicode 技术报告 #16、此 Unicode 字符图表 和 此字符定义。
The
\uFFFF
sequence in Java denotes the character with Unicode codepoint U+FFFF. However, the codepoint does not encode a character at all:See these references: Unicode Technical Report #16, this Unicode character chart and this character definition.
正如其他答案所指定的,搜索
h
将找到以h
开头的字符串范围的开头,而h\uFFFF
将找到数据集中以h
开头的字符串范围的结束(不包括)。最后一句意味着搜索
h\uFFFF
将显示您将在何处插入这样的字符串(如果您的数据中不存在该字符串),这就是为什么它会为您提供范围的唯一结尾。As other answers have specified, searching for
h
will find the beginning of the range of strings starting withh
, whileh\uFFFF
will find the end (exclusive) of the range of strings starting withh
in your set of data.The last sentence means that searching for
h\uFFFF
will show you where you would insert such a string, if it does not exist in your data, which is why it gives you the exclusive end of your range.