我不可能理解所描述的字符串搜索方法。 uFFFF是什么?

发布于 2024-12-29 12:04:46 字数 330 浏览 0 评论 0原文

我正在阅读有关在排序的字符串数组中搜索字符串(范围)的内容。

它说:

如果你想查找所有以“h”开头的字符串,你可以运行 对字符串“h”和“h\uFFFF”进行二分搜索。这给出了所有 所有以“h”开头的键的 band 索引。请注意,一个 二分查找可以返回字符串所在的索引,即使 它实际上并不在数组中。

我不明白这一段的任何内容。

h\uFFFF 是什么,它如何帮助/在二分搜索中使用,最后一句话是否也意味着即使这个搜索也是错误的?

请问对理解这里所说的内容有帮助吗?

I am reading something about searching for a (range of) string(s) in a sorted array of strings.

It says:

If you want to find all strings starting with "h", you can run a
binary search for the strings "h" and "h\uFFFF". This gives all the
indexes of the band for all the keys that start with "h". Note that a
binary search can return the index where the string would be even if
it is not actually in the array.

I don't understand anything from this paragraph.

What is h\uFFFF how does it help/is used in the binary search and does the last sentense also mean that even this search is faulty?

Any help to understand what is being said here please?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

羁客 2025-01-05 12:04:46

\uFFFF 是 16 位“字母表”中排在最后的“字符”,即在任何有效字母、字符或特殊符号之后。

当您在排序数组中对字符串进行二分搜索时,您会找到可以插入该字符串的位置。当您有多个相同的字符串时,您将获得第一个字符串之前的位置。当您在字符串后面附加“字母表的最后一个字母”时,插入点将位于最后一个相同字符串之后,从而在排序数组中为您提供一系列相同的字符串。

想象一下:假设您不允许在单词中使用字母 Z。现在您有了一个已排序的字符串数组:

0   1   2   3   4   5   6
aab abb abc abc abd bcx bdy

如果您搜索 abc,二分搜索会告诉您可以插入它的第一个位置,即 2。如果您搜索 abcZ >,但是,二分查找将返回 4,因为 abcZ 按字母顺序紧接在 abc 之后。这让您知道 2(含)和 4(不包括)之间的范围被字符串 abc 占据。如果两次搜索返回相同的数字,则您知道该字符串不存在于数组中。

在您引用的段落中, \uFFFF 在我的示例中扮演“禁止字母 Z”的角色。

\uFFFF is the "character" that sorts last in the 16-bit "alphabet", i.e. after any valid letter, character, or special symbol.

When you do binary search for a string in a sorted array, you find a place where that string could be inserted. When you have multiple identical strings, you get a location ahead of the first one. When you append "the last letter of the alphabet" after your string, the insertion point will be after the last of the identical strings, hence giving you a range of identical strings in a sorted array.

Imagine this: suppose you are not allowed to use letter Z in your words. Now you have a sorted array of strings:

0   1   2   3   4   5   6
aab abb abc abc abd bcx bdy

If you search for abc, binary search tells you the first place where you can insert it, which is 2. If you search for abcZ, thoug, binary search would return 4, because abcZ comes alphabetically right after abc. This lets you know that the range between 2, inclusive, and 4, exclusive, is occupied by the string abc. If both searches return the same number, you know that the string is not present in the array.

In the paragraph that you quoted, \uFFFF plays the role of the "prohibited letter Z" from my example.

江南月 2025-01-05 12:04:46

\uFFFF 是 Java 中最大的字符。由于字符串已排序,因此搜索 h 将找到范围的开头,而 h\uFFFF 将找到结尾(假设此处为 unicode 字符串),因为无法找到第二个字符大于\uFFFF。即使它不能完全匹配字符串,搜索也会返回目标所在位置的索引,即使它并不真正存在。

更新:\uFFFF 是 16 位块中最大可能的可排序 unicode 字符,如果您正在使用 32 位块,请使用 U+10FFFF(无论其中的内容是什么)爪哇)。我个人从未在 Java 中使用过 32 位 unicode 块。请参阅5.2.0 规范的第 16.7 节。

U+FFFF 和 U+10FFFF。这两个非字符代码点具有
与最大代码单元值相关联的属性
特定的 Unicode 编码形式。 在UTF-16中,U+FFFF是关联的
具有最大的16位代码单元值FFFF。 U+10FFFF 是
与最大合法 UTF-32 32 位代码单元值关联,
10FFFF。该属性呈现这两个非字符代码点
作为哨兵对于内部目的很有用。 例如,它们可能是
用于指示列表的结尾,表示索引中的值
保证高于任何有效字符值,依此类推


\uFFFF is the largest possible character in Java. Since the strings are sorted, searching for h will find the start of the range while h\uFFFF will find the end (assuming unicode strings here) since no second character can be larger than \uFFFF. Even if it can't match the string exactly, the search will return the index of where the target would be even if it's not really there.

update: \uFFFF is the largest possible sortable unicode character in 16 bit block, if you are working with 32 bit blocks use U+10FFFF (whatever that is in Java). I've personally never worked 32 bit unicode blocks in Java. See the section 16.7 of the 5.2.0 spec.

U+FFFF and U+10FFFF. These two noncharacter code points have the
attribute of being associated with the largest code unit values for
particular Unicode encoding forms. In UTF-16, U+FFFF is associated
with the largest 16-bit code unit value, FFFF
. U+10FFFF is
associated with the largest legal UTF-32 32-bit code unit value,
10FFFF . This attribute renders these two noncharacter code points
useful for internal purposes as sentinels. For example, they might be
used to indicate the end of a list, to represent a value in an index
guaranteed to be higher than any valid character value, and so on

霞映澄塘 2025-01-05 12:04:46

Java 中的 \uFFFF 序列表示 Unicode 代码点为 U+FFFF 的字符。但是,代码点根本不编码字符:

U+FFFF 用于表示保证不是字符的数值,用于索引末尾的最终值等用途。

请参阅以下参考资料:Unicode 技术报告 #16此 Unicode 字符图表此字符定义

The \uFFFF sequence in Java denotes the character with Unicode codepoint U+FFFF. However, the codepoint does not encode a character at all:

U+FFFF is used to represent a numeric value that is guaranteed not to be a character, for uses such as the final value at the end of an index.

See these references: Unicode Technical Report #16, this Unicode character chart and this character definition.

掩饰不了的爱 2025-01-05 12:04:46

正如其他答案所指定的,搜索 h 将找到以 h 开头的字符串范围的开头,而 h\uFFFF 将找到数据集中以 h 开头的字符串范围的结束(不包括)。

最后一句意味着搜索 h\uFFFF 将显示您将在何处插入这样的字符串(如果您的数据中不存在该字符串),这就是为什么它会为您提供范围的唯一结尾。

As other answers have specified, searching for h will find the beginning of the range of strings starting with h, while h\uFFFF will find the end (exclusive) of the range of strings starting with h in your set of data.

The last sentence means that searching for h\uFFFF will show you where you would insert such a string, if it does not exist in your data, which is why it gives you the exclusive end of your range.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文