字符串插值搜索

发布于 2024-09-18 10:53:59 字数 515 浏览 10 评论 0原文

对于那些不熟悉插值搜索的人来说,这是一种在排序数组中搜索值的方法,可能比二分搜索更快。您查看第一个和最后一个元素,并(假设数组的内容均匀分布)线性插值以预测位置。

例如:我们有一个长度为 100 的数组,其中 array[0]=0 和 array[99]=99。如果我们正在寻找 80,那么直观地尝试 array[80] 而不是 array[50],如果数组接近均匀分布,则预期运行时间会减少到 log(log(N))log(log(N)) code>

对于数字,要检查的位置由以下公式定义: low + ((toFind -sortedArray[low]) * (high - low + 1)) / (sortedArray[high] -sortedArray[low])

用于展示插值搜索的直观本质的一个常见示例是:想象一下尝试在字典中查找单词“yellow”。您不会使用二分搜索并转到中间点。相反,您会前往预期的位置。

人类可以自然地线性插入字符串,但我不知道如何对其进行编码。 我们如何线性插值字符串?

For those of you not familiar with interpolation search, it is method to search for a value in a sorted array that is potentially faster than binary search. You look at the first and last element and (assuming that the contents of the array are uniformly distributed) linearly interpolate to predict the location.

For example: we have an array of length 100 with array[0]=0 and array[99]=99. If we are looking for 80, it is intuitive to try array[80] over array[50], and if the array is close to uniformly distributed, the expected runtime is reduced to log(log(N))

For numbers, the location to check is defined by the equation:
low + ((toFind - sortedArray[low]) * (high - low + 1)) / (sortedArray[high] - sortedArray[low]).

A common example used to show off the intuitive nature of interpolation search is: imagine trying to find the word 'yellow' in a dictionary. You wouldn't use binary search and go to the half way point. Rather, you would go to the expected location.

Humans can naturally linearly interpolate strings, but I can't figure out how code it up.
How do we linearly interpolate strings?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

攒一口袋星星 2024-09-25 10:53:59

要找到两个字符串之间的“距离”,一个简单的方法是查看它们之间不同的第一个字母,并为每个字符串分配一个数值,然后求差。

例如,如果为每个字母分配的值等于其在字母表中的位置,则从“a”到“y”的距离将为 24,从“y”到“z”的距离将为 1。

更好的方法是通过字典来根据各个字母在实际单词中的常见程度来对它们进行加权。

另一个改进是查看两个字符 - 例如,“aa”与“bz”的距离比“az”与“ba”的距离更远。超过两个字符不会给你带来太多好处。

这种方法不太流行的原因是它使二分搜索算法变得复杂而没有太多收益。如果您计算一下时间,您甚至可能会发现标准二分搜索更快;你在较少的比较中获得的东西,你会因确定距离的复杂性而失去。

另请注意,该算法的最坏情况性能比二分搜索更差。例如,考虑在“aa”、“ab”、“ac”、“ad”、“ae”、“zz”列表中搜索“ae” - 异常值“zz”将使搜索产生偏差,因此总是尝试搜索范围的开头。在这些条件下它会退化为 O(n)。

To find the "distance" between two strings, a simple method would be to look at the first letter that is different between them and assign a numeric value to each, then take the difference.

For example, the distance from "a" to "y" would be 24 and the distance from "y" to "z" would be 1, if each letter were assigned a value equal to its position in the alphabet.

A better performing method would go through a dictionary to weight the various letters by how common they are in actual words.

Another refinement would be to look at two characters - "aa" is farther from "bz" than "az" is from "ba", for example. Going beyond two characters wouldn't buy you much.

The reason this method isn't more popular is that it complicates the binary search algorithm for not a lot of gain. If you were to time it you might even find that standard binary search is faster; what you gain in fewer comparisons you lose in the complexity of determining distances.

Also note that the worst-case performance of this algorithm is worse than a binary search. Consider for example searching for "ae" in the list of "aa","ab","ac","ad","ae","zz" - the outlier "zz" is going to bias the search so that it's always trying the beginning of the search range. It degrades to O(n) under these conditions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文