在排序文件中使用二分搜索实现超快速自动完成(300000 行)
在我的 Android 应用程序中,我想要一个具有自动完成功能的输入字段。项目数量约为 300000。最好的解决方案似乎是将项目放入一个文件中(在 SD 卡上),每行一个项目,每行将具有相同数量的字符,以便我可以查找特定的行号。如果用户在文本字段中输入内容,我将二进制搜索(通过 RandomAccessFile)文件并显示建议。
我希望自动完成速度超级快(理想情况下低于 100 毫秒,但我想这是不可能的),我可以做哪些优化?
更新1: 我将把用户输入转换为带有空格的小写英文字符(az)。因此“A/b”将被转换为“a b”,然后进行搜索。
Uodate 2: 我现在意识到我需要额外的东西 - 搜索以单词开头的子字符串。
In my Android app I want to have an input field with autocomplete. The number of items will be about 300000. The best solution seems to be to put the items into a file (on sdcard), one item per line, each line would have the same number of characters so that I can seek to specific line number. If the user enters something in the text field, I would binary search (via RandomAccessFile) the file and show suggestions.
I want the autocomplete to be super fast (ideally under 100ms but I guess it's impossible), what optimizations I can do?
Update 1:
I will convert the users input to lowercase english characters (a-z) with spaces. So 'A/b' would be converted to 'a b' and then searched.
Uodate 2:
I now realized I need additional thing - to search for word-starting substrings.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
您正在寻找的称为 TRIE
http://forums.sun.com/thread.jspa ?threadID=5295936
在计算机科学中,特里树或前缀树是一种有序树数据结构,用于存储关联数组,其中键通常是字符串。与二叉搜索树不同,树中没有节点存储与该节点关联的键;相反,它在树中的位置显示了它与哪个键关联。节点的所有后代都有与该节点关联的字符串的公共前缀,并且根与空字符串关联。值通常不与每个节点相关联,仅与叶节点和一些与感兴趣的键相对应的内部节点相关联。
What your looking for is called a TRIE
http://forums.sun.com/thread.jspa?threadID=5295936
In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.
为什么不只使用 SQLite DB 而不是文本文件?
我认为在您的情况下,您无法比便携式数据库做得更快。
Why don't you just use a SQLite DB rather than a text file?
I don't think you can do anything better speed-wise than a portable database in your situation.
Trie 是显而易见的答案,并且已经提到过,但另外 tr13 库 可能就是您正在查看的内容。它是垃圾收集器友好的(单个原始字节数组或字节缓冲区),紧凑,并且对于您的情况来说绝对足够快。键通常是 UTF-8 字符串,但也可以是任何字节序列。值同样如此,尽管还有可变长度整数(vints)的替代方案,用于获得非常紧凑的字符串到整数的查找(特别是对于较小的整数集)。
Trie is the obvious answer, and already mentioned, but additionally tr13 library might be what you are looking at. It is garbage collector friendly (single raw byte array or byte buffer), compact, and definitely fast enough for your case. Keys are typically UTF-8 strings although can be any byte sequences. Values likewise, although there is also alternative for variable-length ints (vints) used to get very compact String-to-int lookups (esp. for smallish set of ints).
一种策略是使用 RandomAccessFile 和二分搜索来缩小结果范围。然后,一旦可能的条目足够小,就将该部分加载到内存中,并进行内存中搜索。
这将提高性能,因为当人们键入时,您可以快速搜索已加载到内存中的文件的同一部分。
One strategy could be to narrow the results using the
RandomAccessFile
and Binary Search. Then, once the possible entries is small enough, load that portion into memory, and do an in memory search.This will improve performance, because as people type you can quickly search the same portion of the file that you have loaded in memory.
查看 http://en.wikipedia.org/wiki/Binary_search_algorithm
在排序文件中 你有一个二分搜索最坏的情况 O(log(n))
下一个最好的事情是某种哈希映射,它的时间复杂度为 O(1),尽管这对于部分单词来说很复杂,并且会产生一个巨大的映射表。
check this out http://en.wikipedia.org/wiki/Binary_search_algorithm
in a sorted file you have a binary search worst case of O(log(n))
the next best thing would be some sort of hashmapping whic goes O(1) altough this is complicated for partial words and will produce a huge mapping table.
提前将可能性预处理到搜索树中,而不是在运行时进行。
Preprocess your possibilities into a search tree ahead of time, instead of doing it at runtime.
每行一个字存储的一个主要问题是,在恒定时间内无法随机访问行(访问 X 行包括从文件开头开始计算 X 个换行符),因此您的二分查找会受到影响。
在这种特定(自动完成)情况下,您需要的是 前缀树 或以下内容的变体它(将多个节点组合成一个,或者将小于特定大小的子树变成普通的旧排序单词列表)。
A major problem with one-word-per-line storage is that there is no random access for lines in constant time (accessing line X consists in counting X newline characters from the beginning of the file) so your binary search would suffer.
What you need in this specific (auto-complete) situation is a Prefix Tree or a variation of it (combining several nodes into one, or turning sub-trees smaller than a certain size into plain old sorted list of words).
100ms 的时间足够了。我认为最大的担忧是显示更新。
如果您想避免使用实际的数据库,除了主文件之外,使用简单的索引文件就可以很容易地做到这一点。
您可以每 32 条记录左右将字符串的前 N 个字节(可能是 4 个?)和文件偏移量存储到主文件中的索引中,并对其进行二分搜索。在二分搜索非常接近之后,您可以线性搜索最多 32 条记录。
您可以根据平均字符串长度和介质上单次读取的大小,将索引频率从 32 条记录调整为有意义的值。如果您有 512 字节的文件系统读取和 8 字节的平均字符串,那么您将每 64 条记录创建一个索引,等等。每个最小磁盘读取大小拥有多个索引记录并没有多大意义。
索引文件可以轻松生成,然后您可以使用简单的文本编辑器管理主文件。
100ms is plenty of time. The biggest worry would be the display updates, I'd think.
If you're wanting to avoid an actual database, this is easy enough to do with a simple index file in addition to your main file.
You could store the first N bytes (4 maybe?) of the string and a file offset into the main file in a index every 32 records or so, and binary search across that. You could then linearly search through up to 32 records after a binary search got you pretty close.
You can tune the index frequency from 32 records to whatever makes sense given your average string length and the size of a single read on your media. If you had 512 byte filesystem reads, and 8 byte average strings, then you'd do an index every 64 records, etc. There's not much point in having more than one index record per minimum disk read size.
The index file could be generated easily, and you could then manage the main file with a simple text editor.
我建议看看您是否可以使用标准库来实现此目的。也许apache lucene可以在android手机上使用。如果是这样,您可以构建一个索引(单词前缀 -> android sql lite 中单词的 id)。这是关于 lucene 使用的一种算法的讨论。
I would suggest to see if you can use a standard library for this purpose. Maybe apache lucene can be used in android phones. If so, you can build an index (word prefix -> an id of a word in the android sql lite). Here is a discussion about a kind of algorithm lucene is using.
旧线程,但这就是您需要的:
Stringsearch 库
我将它用于我的 Android 应用程序“Wordlist Pro”,速度非常快。
Old thread, but THIS IS WHAT YOU NEED:
Stringsearch library
I used it for my app 'Wordlist Pro' for Android and it is really speedy.
我也可以做这样的事情(下面是一个预处理文件):
如果用户输入以 aa 开头的内容,我会读取第 1 - 17 行并顺序搜索它们
I could also do something like this (below is a preprocessed file):
If user inputs something starting with aa, I would read lines 1 - 17 and sequentially search in them