大数据模式匹配的数据结构

发布于 2024-11-06 04:35:26 字数 1174 浏览 3 评论 0原文

问题背景

我的词汇量有限，包含 10 个符号 [AJ]。这些符号的含义与问题无关。它们可以是 DNA 碱基、音素、单词等。

项目是符号序列。在此问题中，所有项目的长度相同（例如 6）。例如，

A C B A D J

我有一个大（5M）表，其中包含从一些已知数据中采样的所有长度为 6 的项目的计数。例如，

A C B A D J     55
B C B I C F     923
A B C D E G     478

给定一个带有一个未知符号的新序列，我的任务是猜测该符号。在以下示例中，缺少的符号是 ?。

B C B ? C F

猜测 ? 的一个简单解决方案是查看我的表格并找到符合模式 BCB ? 的最大计数的项目。 C F

问题

什么是一个好的数据结构来存储我的项目频率表，以便我合理有效地处理时空？如果查询时的计算合理，我更愿意使用较少的内存。（我将拥有许多这样的表，因此 5M 数字只是一个近似值。）
有哪些实现细节可以对处理速度产生很大的影响？

我想到的事情：

将每个序列组成一个字符串，并使用正则表达式进行匹配。警告： 1. O(n) 是不可接受的。 (2) 正则表达式速度慢。 (3) 字符串（至少在java中）是臃肿的。
让 Lucene 处理索引。关闭 tfidf 评分。使用短语搜索。可以使用计数值进行评分，以便 Lucene 也负责排序。
使用前缀和后缀尝试为每个项目建立索引。
使用数据库（可能在内存中）将整个数据放在一个/单独的列中来处理搜索。
使用

更新

在我的实际应用程序中，我将使用单独存储的长度为 5,6,7,8,9,10 的序列。我通过将其限制为固定长度来简化问题。因此，限制/偏好使用较少内存的解决方案。
我的词汇量可以假设在 20 以下。

原文

Problem Background

I have a finite vocabulary containing of say 10 symbols [A-J]. What these symbols mean is not relevant to the question. They could be DNA bases, phonemes, words etc.

An item is a sequence of symbols. In this problem, all items are of the same length (say 6). E.g.

A C B A D J

I have a large (5M) table that contains counts of all items of length 6 sampled from some known data. E.g.

A C B A D J     55
B C B I C F     923
A B C D E G     478

Given a new sequence with one unknown symbol, my task is to guess the symbol. In the following example, the missing symbol is ?.

B C B ? C F

A simple solution to guess ? is to look into my table and find the item with the largest count that fits the pattern B C B ? C F

Questions

What is a good data structure to store my item-frequency table so that I handle space-time reasonably efficiently? I prefer to use less memory if the computation at query time is reasonable. (I will be having many such tables and so the 5M number is just an approximation.)
What are some implementation details that can make a big difference in processing speed?

Things I have thought of:

Make a string out of every sequence and use regexes to match.
Caveat: 1. O(n) is unacceptable. (2) Regexes are slow. (3) Strings (in java at least) are bloated.
Let Lucene handle the indexing. Turn off tfidf scoring. Use phrase-search. Potentially use the count values for scoring so that Lucene takes care of the sorting too.
Use prefix and suffix tries to index each item.
Use db (likely in-memory) with the entire data in one/separate column to handle the search.

Updates

In my actual application, I will be working with sequences of length 5,6,7,8,9,10 stored separately. I simplified the problem by restricting it to a fixed length. Hence the constraint/preference to a solution that uses less memory.
My vocabulary size can be assumed to be under 20.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小苏打饼 2024-11-13 04:35:26

使用 tries 做出的决定似乎是最好的选择：根据叶子上字符串出现的次数，您可以轻松设计函数，该函数将在 O(log n) 时间内返回所有可能缺少一个字符的字符串，然后您只需迭代这少量字符串，搜索最大出现次数。如果使用从 A 到 Z 的字符，最多会有 26 个这样的字符串，因此迭代不会花费很多时间。

AFAIK，Lucene 在其内部使用这种机制通配符搜索，这样您就可以连接字符，使用 KeywordAnalyzer 对其进行索引（以省略词干提取），然后搜索“ACB?DJ”。这里唯一的限制是 Lucene 无法处理第一个“？”的搜索，但您可以通过在开头添加一个额外的字符（只是绕过 Lucene 检查的技巧）或通过为反向单词多一个索引来绕过它（将提高性能）对于以通配符作为第一个字符的单词很多）。

最后，如果您首先必须计算出现次数，则可以使用一些机器学习方案（例如决策树）来处理所有工作。在某些情况下，决策树用于压缩数据库并加快搜索速度，因此您也可以这样做。使用行作为实例，使用字符位置作为属性，使用字符本身作为属性值。然后运行一些算法，如 C4.5（您可以使用 Weka 的实现，称为 J48 ）以最少的修剪和运行分类 - 算法将完成剩下的工作！

回复收藏 0 原文

梦年海沫深 2024-11-13 04:35:26

根据评论，只有 1 个未知数，您可以执行以下操作：

但是您的数据位于哈希表中。当您需要查找某个模式时，请生成所有通配符组合，因为您的词汇量有限，这意味着最多查找 20 个模式。这听起来像是一种黑客攻击，但如果您考虑其他方法对性能的影响，那么它是很难被击败的。哈希表查找的时间复杂度为 O(1)，20 次查找也是 O(1)。

如果通配符的数量可能增加，则不建议使用此方法，尽管它对于 2 或 3 个通配符仍可能表现良好。

双数组 trie 也可以工作，并且可能会减少存储字符串的空间量，但性能会受到影响。

回复收藏 0 原文

剩余の解释 2024-11-13 04:35:26

为了唯一地表征一个新序列，需要两条信息：五个已知符号的序列（串）和未知符号的位置。如果您的字母表有 10 个符号，则唯一的五符号字符串不能超过 10^5 = 100000 个。

根据您的内存资源，这可能足够小以适合哈希表，其条目提供查找结构以找到最佳（位置、符号）组合。例如：

---------
| BCBCF | --> { 0: <heap of symbols partially ordered by frequency>, ... }
---------

这应该允许对新序列进行相当有效的查找：连接已知符号，在哈希表中查找序列，找到未知字符的位置，然后返回位于相应字符顶部的符号堆。

如果您可以保证在执行任何查找之前查找结构将是稳定的（没有新输入），那么您可以通过将每个位置索引堆替换为本来位于的单个符号来提高效率。堆的顶部。（只有在查找阶段如果出现可能改变符号频率的新信息，才需要堆结构。）

In order to uniquely characterize a new sequence, two pieces of information are necessary: the sequence (string) of five known symbols, and the position of the unknown symbol. If your alphabet has 10 symbols then there can be no more than 10^5 = 100000 unique five-symbol strings.

Depending on your memory resources, this may be small enough to fit into a hashtable whose entries provide a lookup structure to find the best (position, symbol) combination. For example:

---------
| BCBCF | --> { 0: <heap of symbols partially ordered by frequency>, ... }
---------

This should allow for a fairly efficient lookup for a new sequence: concatenate the known symbols, look up the sequence in the hashtable, find the position of the unknown character, and then return the symbol that is at the top of the corresponding heap.

If you can guarantee the lookup structure will be stable (no new input) before any lookup is performed, then you can squeeze a bit more efficiency out of it by replacing each of the position-indexed heaps with the single symbol that would have been at the top of the heap. (The heap structure is only necessary during the lookup phase if new information will come in that may change the symbol frequencies.)

回复收藏 0 原文

强辩 2024-11-13 04:35:26

数据库是一个简单的解决方案，但另一种解决方案是一棵树，其中每个节点选择一个字符，叶子将包含可能的结果和计数的数组。那么在树中只需要 5 步就可以匹配一个字符串。但是创建树需要 N*C 时间，其中 N 是项目数，C 是每个项目中的字符数。通配符只是树中的一个节点，它将同时从输入中删除一个字符，但保持可能的结果不变。

回复收藏 0 原文

随风而去 2024-11-13 04:35:26

我是这里“每个人都忽略了显而易见的事情”的人之一。

只需使用您可用的任何快速键/值查找即可。并查找所有可能的值。这是一个小集合，不会花很长时间。除了存储数据 6 次之外，任何其他操作都会变慢。

如果您的词汇量很大，那么我之前的回答就很合适。

这是我旧的（而且不好的）答案。

我会将它们保存在具有多个串联索引的数据库中。有多少由你决定。

至少我会有 2 个。我将在 (col1, col2, col3, col4, col5, col6) 和 (col4, col5, col6, col1, col2, col3 ）。这意味着，无论缺少哪一列，都有一种方法可以获取您的数据并且最多只查看 1/1000 的记录。如果您愿意，可以改为索引 (col1, col2, col3, col4, col5, col6), (col3, col4, col5, col6, col1, col2) 和 < code>(col5, col6, col1, col2, col3, col4) 将搜索限制为数据的 1/10000。这会使用一半的内存，但速度提高了 10 倍。（警告，我不能保证 MySQL 能够成功地确定它应该使用哪个索引。我希望其他数据库能够正确地确定它，但还没有测试过。）

如果您不想使用数据库，您可以使用平衡二叉树完全按照我建议使用上面的索引。对于任何给定的搜索，选择具有尽可能深的缺失元素的树。进行范围搜索。过滤返回的数据以仅查找感兴趣的行。事实上，这正是一个好的数据库应该对这些索引执行的操作。