大数据模式匹配的数据结构
问题背景
我的词汇量有限,包含 10 个符号 [AJ]。这些符号的含义与问题无关。它们可以是 DNA 碱基、音素、单词等。
项目是符号序列。在此问题中,所有项目的长度相同(例如 6)。例如,
A C B A D J
我有一个大(5M)表,其中包含从一些已知数据中采样的所有长度为 6 的项目的计数。例如,
A C B A D J 55
B C B I C F 923
A B C D E G 478
给定一个带有一个未知符号的新序列,我的任务是猜测该符号。在以下示例中,缺少的符号是 ?。
B C B ? C F
猜测 ? 的一个简单解决方案是查看我的表格并找到符合模式 BCB ? 的最大计数的项目。 C F
问题
什么是一个好的数据结构来存储我的项目频率表,以便我合理有效地处理时空?如果查询时的计算合理,我更愿意使用较少的内存。 (我将拥有许多这样的表,因此 5M 数字只是一个近似值。)
有哪些实现细节可以对处理速度产生很大的影响?
我想到的事情:
将每个序列组成一个字符串,并使用正则表达式进行匹配。 警告: 1. O(n) 是不可接受的。 (2) 正则表达式速度慢。 (3) 字符串(至少在java中)是臃肿的。
让 Lucene 处理索引。关闭 tfidf 评分。使用短语搜索。可以使用计数值进行评分,以便 Lucene 也负责排序。
使用前缀和后缀尝试为每个项目建立索引。
使用数据库(可能在内存中)将整个数据放在一个/单独的列中来处理搜索。
使用
更新
- 在我的实际应用程序中,我将使用单独存储的长度为 5,6,7,8,9,10 的序列。我通过将其限制为固定长度来简化问题。因此,限制/偏好使用较少内存的解决方案。
- 我的词汇量可以假设在 20 以下。
Problem Background
I have a finite vocabulary containing of say 10 symbols [A-J]. What these symbols mean is not relevant to the question. They could be DNA bases, phonemes, words etc.
An item is a sequence of symbols. In this problem, all items are of the same length (say 6). E.g.
A C B A D J
I have a large (5M) table that contains counts of all items of length 6 sampled from some known data. E.g.
A C B A D J 55
B C B I C F 923
A B C D E G 478
Given a new sequence with one unknown symbol, my task is to guess the symbol. In the following example, the missing symbol is ?.
B C B ? C F
A simple solution to guess ? is to look into my table and find the item with the largest count that fits the pattern B C B ? C F
Questions
What is a good data structure to store my item-frequency table so that I handle space-time reasonably efficiently? I prefer to use less memory if the computation at query time is reasonable. (I will be having many such tables and so the 5M number is just an approximation.)
What are some implementation details that can make a big difference in processing speed?
Things I have thought of:
Make a string out of every sequence and use regexes to match.
Caveat: 1. O(n) is unacceptable. (2) Regexes are slow. (3) Strings (in java at least) are bloated.Let Lucene handle the indexing. Turn off tfidf scoring. Use phrase-search. Potentially use the count values for scoring so that Lucene takes care of the sorting too.
Use prefix and suffix tries to index each item.
Use db (likely in-memory) with the entire data in one/separate column to handle the search.
Updates
- In my actual application, I will be working with sequences of length 5,6,7,8,9,10 stored separately. I simplified the problem by restricting it to a fixed length. Hence the constraint/preference to a solution that uses less memory.
- My vocabulary size can be assumed to be under 20.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
使用 tries 做出的决定似乎是最好的选择:根据叶子上字符串出现的次数,您可以轻松设计函数,该函数将在 O(log n) 时间内返回所有可能缺少一个字符的字符串,然后您只需迭代这少量字符串,搜索最大出现次数。如果使用从 A 到 Z 的字符,最多会有 26 个这样的字符串,因此迭代不会花费很多时间。
AFAIK,Lucene 在其内部使用这种机制通配符搜索,这样您就可以连接字符,使用 KeywordAnalyzer 对其进行索引(以省略词干提取),然后搜索“ACB?DJ”。这里唯一的限制是 Lucene 无法处理第一个“?”的搜索,但您可以通过在开头添加一个额外的字符(只是绕过 Lucene 检查的技巧)或通过为反向单词多一个索引来绕过它(将提高性能)对于以通配符作为第一个字符的单词很多)。
最后,如果您首先必须计算出现次数,则可以使用一些机器学习方案(例如决策树)来处理所有工作。在某些情况下,决策树用于压缩数据库并加快搜索速度,因此您也可以这样做。使用行作为实例,使用字符位置作为属性,使用字符本身作为属性值。然后运行一些算法,如 C4.5(您可以使用 Weka 的 实现,称为 J48 )以最少的修剪和运行分类 - 算法将完成剩下的工作!
Decision with tries seems to be the best one: with number of string occurrence on leaves you can easily design function that will return all possible strings with one missing character in O(log n) time, and then you just iterate over this small number of strings, searching for the max number of occurrences. If you use chars from A to Z, there will be at most 26 such strings, so iterating will not take a lot.
AFAIK, Lucene uses such mechanism internally for its wildcards search, so you can concatenate your chars, index them with
KeywordAnalyzer
(to omit stemming) and then search as "ACB?DJ". The only restriction here is that Lucene cannot handle searches with first "?", but you can bypass it by adding one extra char at the beginning (just trick to bypass Lucene checks) or by having one more index for reversed words (will increase performance for words with wildcard as a first char a lot).And, finally, if you first have to calculate number of occurrences anyway, you can use some machine learning schemes such as decision trees to handle all the work. There were cases, when decision trees were used to compress database and speed up search, so you can do the same. Use lines as instances, position of chars as attributes and chars themselves as attribute values. Then run some algorithm like C4.5 (you can use Weka's implementation called J48) with minimal pruning and run classification - the algorithm will do the rest!
根据评论,只有 1 个未知数,您可以执行以下操作:
但是您的数据位于哈希表中。当您需要查找某个模式时,请生成所有通配符组合,因为您的词汇量有限,这意味着最多查找 20 个模式。这听起来像是一种黑客攻击,但如果您考虑其他方法对性能的影响,那么它是很难被击败的。哈希表查找的时间复杂度为 O(1),20 次查找也是 O(1)。
如果通配符的数量可能增加,则不建议使用此方法,尽管它对于 2 或 3 个通配符仍可能表现良好。
双数组 trie 也可以工作,并且可能会减少存储字符串的空间量,但性能会受到影响。
Based on the comment that there will only be 1 unknown you can do the following:
But your data in a hashtable. When you need to look up a pattern, generate all the wildcard combinations, since your vocabulary is limited this would mean at most looking up 20 patterns. This sounds like a hack, but if you consider the performance implications of other methods it is hard to beat. The hash table lookup is O(1), 20 lookups is O(1) too.
This method is not advisable if the number of wildcards could increase, although it may still perform well for 2 or 3.
A double array trie would also work and may reduce the amount of space to store your strings, but the performance would suffer.
为了唯一地表征一个新序列,需要两条信息:五个已知符号的序列(串)和未知符号的位置。如果您的字母表有 10 个符号,则唯一的五符号字符串不能超过 10^5 = 100000 个。
根据您的内存资源,这可能足够小以适合哈希表,其条目提供查找结构以找到最佳(位置、符号)组合。例如:
这应该允许对新序列进行相当有效的查找:连接已知符号,在哈希表中查找序列,找到未知字符的位置,然后返回位于相应字符顶部的符号堆。
如果您可以保证在执行任何查找之前查找结构将是稳定的(没有新输入),那么您可以通过将每个位置索引堆替换为本来位于的单个符号来提高效率。堆的顶部。 (只有在查找阶段如果出现可能改变符号频率的新信息,才需要堆结构。)
In order to uniquely characterize a new sequence, two pieces of information are necessary: the sequence (string) of five known symbols, and the position of the unknown symbol. If your alphabet has 10 symbols then there can be no more than 10^5 = 100000 unique five-symbol strings.
Depending on your memory resources, this may be small enough to fit into a hashtable whose entries provide a lookup structure to find the best (position, symbol) combination. For example:
This should allow for a fairly efficient lookup for a new sequence: concatenate the known symbols, look up the sequence in the hashtable, find the position of the unknown character, and then return the symbol that is at the top of the corresponding heap.
If you can guarantee the lookup structure will be stable (no new input) before any lookup is performed, then you can squeeze a bit more efficiency out of it by replacing each of the position-indexed heaps with the single symbol that would have been at the top of the heap. (The heap structure is only necessary during the lookup phase if new information will come in that may change the symbol frequencies.)
数据库是一个简单的解决方案,但另一种解决方案是一棵树,其中每个节点选择一个字符,叶子将包含可能的结果和计数的数组。那么在树中只需要 5 步就可以匹配一个字符串。但是创建树需要 N*C 时间,其中 N 是项目数,C 是每个项目中的字符数。通配符只是树中的一个节点,它将同时从输入中删除一个字符,但保持可能的结果不变。
The db would be easy solution, but another solution is a tree where each node chooses one character and leaf would contain array of possible results and counts. Then it would only take 5 steps in the tree to match one string. But creating the tree would take N*C time where N is number of items and C is number of characters in each item. Wildcards is just a node in the tree that will simultaniously remove one character from input but keeps the possible results intact.
我是这里“每个人都忽略了显而易见的事情”的人之一。
只需使用您可用的任何快速键/值查找即可。并查找所有可能的值。这是一个小集合,不会花很长时间。除了存储数据 6 次之外,任何其他操作都会变慢。
如果您的词汇量很大,那么我之前的回答就很合适。
这是我旧的(而且不好的)答案。
我会将它们保存在具有多个串联索引的数据库中。有多少由你决定。
至少我会有 2 个。我将在
(col1, col2, col3, col4, col5, col6)
和(col4, col5, col6, col1, col2, col3 )
。这意味着,无论缺少哪一列,都有一种方法可以获取您的数据并且最多只查看 1/1000 的记录。如果您愿意,可以改为索引(col1, col2, col3, col4, col5, col6)
,(col3, col4, col5, col6, col1, col2)
和 < code>(col5, col6, col1, col2, col3, col4) 将搜索限制为数据的 1/10000。这会使用一半的内存,但速度提高了 10 倍。 (警告,我不能保证 MySQL 能够成功地确定它应该使用哪个索引。我希望其他数据库能够正确地确定它,但还没有测试过。)如果您不想使用数据库,您可以使用平衡二叉树完全按照我建议使用上面的索引。对于任何给定的搜索,选择具有尽可能深的缺失元素的树。进行范围搜索。过滤返回的数据以仅查找感兴趣的行。事实上,这正是一个好的数据库应该对这些索引执行的操作。
I was among the "everyone missing the obvious" here.
Just use any quick key/value lookup that is available to you. And look up all of your possible values. It is a small set, and won't take long. Anything else short of storing your data 6 times will be slower.
If you had a large possible vocabulary, then my previous answer would be appropriate.
Here is my old (and bad) answer.
I would stick them in a database with multiple concatenated indexes. How many is up to you.
At a minimum I would have 2. I would have an index on
(col1, col2, col3, col4, col5, col6)
and(col4, col5, col6, col1, col2, col3)
. This would mean that, no matter which column was missing, there would be a way to get your data and only look through at most 1/1000 of the records. If you wish you could instead index(col1, col2, col3, col4, col5, col6)
,(col3, col4, col5, col6, col1, col2)
and(col5, col6, col1, col2, col3, col4)
to limit your search to 1/10000 of the data. This uses half again as much memory, but is 10 times faster. (Warning, I will not guarantee that MySQL will successfully figure out which index it should use. I'd hope that other databases would get it right, but haven't tested it.)If you wished to not use a database you could use balanced binary trees exactly as I was suggesting using indexes above. For any given search pick the tree that has the missing element as deep as possible. Do a range search. Filter the returned data for just the rows of interest. This is, in fact, exactly what a good database should do above with those indexes.