实现 T9 文本预测
我内存中有一个 T9 字典(trie/hash_map)。字典包含单词评级对,因此当从字典中选取单词时,其评级会增加,并且单词评级对在单词列表中上升。
假设有一种方法可以从字典中选择一个单词。该方法还执行一些单词评级例程。
在输入中,我有一串数字(1-9,“*”用于更改单词和“”),这些数字是在电话上按下的。
问题:
- 有没有快速解析字符串的算法?
- 哪种数据结构比较好?
UPD:
完整问题文本(问题D)
< a href="http://pastebin.com/9babEi21" rel="nofollow">Hash_map 实现
I have a T9 dictionary in memory (trie/hash_map). The dictionary contains word-rating pairs, so when a word is picked from dictionary, its rating increments, and the word-rating pair goes up in the word list.
Let's say that there is a method to pick a word from the dictionary. That method also does some word rating routine.
In input I have string of numbers (1-9, '*' to change word and ' ') which were pressed on telephone.
Questions:
- Is there any algorithm to parse the string fast?
- Which data structure would be good there?
UPD:
Full problem text (Problem D)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为特别有效的一种选择是将 trie 预处理为经过修改的结构,专门针对基于击键的单词预测进行了优化。
直观上,新结构是由可以在任何点按下的可能数字创建的特里树。然后,每个 trie 节点存储可能使用这些数字拼出的单词的优先级队列。然后,您可以通过以下算法预测要使用哪些单词:
该算法需要时间 O(n + Tmax),其中 n 是数字字符串的长度,Tmax 是找到该数字串中最流行的单词所需的时间。给定的前缀。对于二元堆之类的东西,这可能是 O(1),但对于更复杂的堆来说,速度可能会更慢。
要构建此数据结构,您需要按如下方式预处理原始单词/频率列表。对于每个单词,确定与其字母对应的一系列数字。例如,单词“monsoon”将为 6667666,而单词“apple”将为 27753。然后,将该序列插入到数字特里树中,根据需要创建新节点。当到达与该数字串对应的最后一个节点时,将该单词插入到该节点对应的优先级队列中。这个操作的总时间实际上相当不错;给定一个总共有 n 个字母的单词列表,这可以在 O(n Tinsert) 时间内完成,其中 Tinsert 是插入所需的时间一个词进入优先队列。这是因为我们最多需要访问每个字母三次 - 一次将其转换为数字,一次跟踪数字树中的相应边缘,一次将其插入优先级队列。
这种方法还可以轻松地将新单词插入预测器中。每当用户离开数字特里树或选择一个不是第一个单词的单词时,您可以通过在特里树中创建适当的一系列节点来将该单词插入到数字特里树中,然后将该单词插入到正确的优先级中队列。
如果您使用存储指向其最小元素的指针的平衡二叉搜索树制作优先级队列,则可以在 O(n) 时间内找到一系列 n 数字的最快单词,然后列出所有换句话说,每个带有该前缀的单词均摊为 O(1) 时间。您还可以通过这种方式轻松插入或删除新单词。
由于单词存储在 trie 中,因此只需从当前 trie 节点向下走到当前编号给出的子节点,即可在 O(1) 内更新对当前单词的猜测。当用户按下空格键时,您只需重置回数字特里树的根。最后,当用户点击星号时,您可以使用上述技巧来查看给定 trie 节点的下一个最流行的单词。
希望这有帮助!
One option that I think would be particularly efficient would be to preprocess the trie into a modified structure specifically optimized for predicting words based on keystrokes.
Intuitively, the new structure is a trie created out of the possible digits that could be pressed at any point. Each trie node then stores a priority queue of the words that could potentially be spelled out using those digits. You could then predict which words to use via the following algorithm:
This algorithm takes time O(n + Tmax), where n is the length of the digit string and Tmax is the time required to find the most popular word for the given prefix. This could be O(1) for something like a binary heap, but could be slower for more complex heaps.
To build this data structure, you would preprocess the original list of words/frequencies as follows. For each word, determine the series of digits corresponding to its letters. For example, the word "monsoon" would be 6667666, while the word "apple" would be 27753. Then, insert that sequence into the digit trie, creating new nodes as necessary. When you arrive at the final node corresponding to this digit string, insert the word into the corresponding priority queue for that node. The total time for this operation is actually quite good; given a list of words with a total of n letters in it, this can be done in O(n Tinsert) time, where Tinsert is the time required to insert a word into the priority queue. This is because we need to visit each letter at most three times - once to convert it to a digit, once to follow the appropriate edge in the digit trie, and once to insert it into a priority queue.
This approach also makes it easy to insert new words into the predictor quite easily. Whenever the user walks off the digit trie or chooses a word that isn't the number one word, you could just insert that word into the digit trie by creating the appropriate series of nodes in the trie, then inserting the word into the correct priority queue.
If you make your priority queues out of a balanced binary search tree that stores a pointer to its smallest element, you can implement finding the fastest word for a series of n digits in O(n) time, and can then list off all of the other words with that prefix in amortized O(1) time apiece. You could also insert or delete new words very easily this way.
Because the words are stored in a trie, you can in O(1) update your guess of the current word by just walking from the current trie node down into the child node given by the current number. When the user hits the space key, you would just reset back up to the root of the digit trie. Finally, when the user hits star, you could use the above trick to just look at the next most-popular word for the given trie node.
Hope this helps!