搜索 1 GB+ 的最快方法模式第一次出现的数据字符串
有一个 1 GB 的任意数据字符串,您可以假设它等效于以下内容:
1_gb_string=os.urandom(1*gigabyte)
我们将搜索此字符串 1_gb_string
,以查找无限数量的固定宽度、1 KB 模式、1_kb_pattern
。每次我们搜索的模式都会不同。因此缓存机会并不明显。相同的 1 GB 字符串将被一遍又一遍地搜索。这是一个简单的生成器来描述正在发生的情况:
def findit(1_gb_string):
1_kb_pattern=get_next_pattern()
yield 1_gb_string.find(1_kb_pattern)
请注意,只需要找到该模式的第一次出现。此后,不应再进行其他主要处理。
我可以使用什么比 Python 的内置 find()
更快地将 1 KB 模式与 1 GB 或更大的数据字符串进行匹配,内存要求限制为 16 GB?
(我已经知道如何拆分字符串并并行搜索它,因此您可以忽略该基本优化。)
There's a 1 gigabyte string of arbitrary data which you can assume to be equivalent to something like:
1_gb_string=os.urandom(1*gigabyte)
We will be searching this string, 1_gb_string
, for an infinite number of fixed width, 1 kilobyte patterns, 1_kb_pattern
. Every time we search the pattern will be different. So caching opportunities are not apparent. The same 1 gigabyte string will be searched over and over. Here is a simple generator to describe what's happening:
def findit(1_gb_string):
1_kb_pattern=get_next_pattern()
yield 1_gb_string.find(1_kb_pattern)
Note that only the first occurrence of the pattern needs to be found. After that, no other major processing should be done.
What can I use that's faster than Python's built-in find()
for matching 1 KB patterns against 1 GB or greater data strings, memory requirements limited to 16 GB?
(I am already aware of how to split up the string and searching it in parallel, so you can disregard that basic optimization.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
当您澄清较长的预处理是可以接受的时,我建议使用 Rabin-Karp< 的变体/a>:“多模式搜索的首选算法”,正如维基百科所说。
定义一个“滚动哈希”函数,即当您知道 haystack[x:x+N] 的哈希值时,计算 haystack[x+1:x 的哈希值+N+1] 是 O(1)。
(普通的哈希函数,例如 Python 的内置
hash
不具有此属性,这就是为什么您必须编写自己的哈希函数,否则预处理会变得令人筋疲力尽长而不仅仅是有点长;-)。多项式方法是卓有成效的,您可以使用 30 位哈希结果(如果需要,可以通过屏蔽,即,您可以进行更精确的计算,并仅存储选择的屏蔽 30 位)。
为了简洁起见,我们将这个滚动哈希函数称为RH。
因此,当您沿着 1 GB 干草堆滚动时,计算 1 G 的 RH 结果;如果您刚刚存储这些,它将为您提供一个 1 G 30 位值 (4 GB) 的数组 H,映射索引-in-haystack->RH 值。但是您想要反向映射,因此请使用包含 230 条目(1 G 条目)的数组 A,该数组对于每个 RH 值都会为您提供干草堆中感兴趣的所有索引(RH 所在的索引)值发生);对于每个条目,您将第一个可能感兴趣的 haystack 索引的索引存储到 haystack 中另一个包含 1 G 个索引的数组 B 中,该数组的顺序是使 haystack 中的所有索引具有相同的 RH 值(散列术语中的“冲突”)相邻。 br>
H、A 和 B 都有 1 G 个条目,每个条目 4 字节,因此总共 12 GB。
现在,对于每个传入的 1 K 针,计算其 RH,将其称为 k,并将其用作 A 的索引; A[k] 为您提供 B 中值得比较的第一个索引 b。所以,这样做:
有了良好的 RH,您应该很少发生冲突,因此 while 应该执行很少的次数,直到以某种方式返回。
所以每次针搜索应该非常非常快。
As you clarify that long-ish preprocessing is acceptable, I'd suggest a variant of Rabin-Karp: "an algorithm of choice for multiple pattern search", as wikipedia puts it.
Define a "rolling hash" function, i.e., one such that, when you know the hash for
haystack[x:x+N]
, computing the hash forhaystack[x+1:x+N+1]
is O(1).(Normal hashing functions such as Python's built-in
hash
do not have this property, which is why you have to write your own, otherwise the preprocessing becomes exhaustingly long rather than merely long-ish;-).A polynomial approach is fruitful, and you could use, say, 30-bit hash results (by masking if needed, i.e., you can do the computation w/more precision and just store the masked 30 bits of choice).
Let's call this rolling hash function RH for brevity.
So, compute 1 G of RH results as you roll along the haystack 1 GB string; if you just stored these it would give you an array H of 1 G 30-bit values (4 GB) mapping index-in-haystack->RH value. But you want the reverse mapping, so use instead an array A of 230 entries (1 G entries) that for each RH value gives you all the indices of interest in the haystack (indices at which that RH value occurs); for each entry you store the index of the first possibly-interesting haystack index into another array B of 1 G indices into the haystack which is ordered to keep all indices into haystack with identical RH values ("collisions" in hashing terms) adjacent.
H, A and B both have 1 G entries of 4 bytes each, so 12 GB total.
Now for each incoming 1 K needle, compute its RH, call it k, and use it as an index into A; A[k] gives you the first index b into B at which it's worth comparing. So, do:
with a good RH you should have few collisions, so the while should execute very few times until returning one way or another.
So each needle-search should be really really fast.
遗传学领域使用许多字符串匹配算法来查找子字符串。您可以尝试本文或本文
There are a number of string matching algorithms use in the field of genetics to find substrings. You might try this paper or this paper
据我所知,标准查找算法是一种简单的算法,其复杂性约为 n*m 比较,因为
它根据每个可能的偏移量检查模式。还有一些更有效的算法,需要大约 n+m 次比较。
如果您的字符串不是自然语言字符串,您可以尝试
Knuth–Morris–Pratt 算法 。 Boyer–Moore 搜索算法也足够快速且简单。
As far as I know, standard find algorithm is naive algorithm with complexity about n*m comparisons, because
it checks patterns against every possible offset. There are some more effective algoithms, requiring about n+m comparisons.
If your string is not a natural language string, you can try
Knuth–Morris–Pratt algorithm . Boyer–Moore search algorithm is fast and simple enough too.
您愿意花费大量时间预处理字符串吗?
如果是的话,您可以做的是构建一个带有偏移量的 n-gram 列表。
假设您的字母表是十六进制字节并且您使用的是 1-gram。
然后,对于 00-ff,您可以创建一个如下所示的字典(具体而言,抱歉),
您可以沿着字符串向下走,并从字节发生的所有点构建 @array_of_offsets。您可以对任意 n 元模型执行此操作。
这提供了一个“搜索起点”,您可以用它来行走。
当然,缺点是你必须预处理字符串,所以这是你的权衡。
编辑:
这里的基本思想是匹配前缀。如果信息超级相似,这可能会导致严重后果,但如果 n 元语法之间存在相当大的差异,那么您应该能够很好地匹配前缀。
让我们量化差异,因为您还没有讨论您正在分析的信息类型。出于该算法的目的,我们可以将散度描述为距离函数:您需要一个相当高的汉明距离。如果 n 元语法之间的汉明距离为 1,则上述想法将不起作用。但如果是n-1的话,上面的算法就会简单很多。
为了改进我的算法,让我们构建一个算法来连续消除一些可能性:
我们可以调用 Shannon Entropy 定义给定 n 元语法的信息。获取您的搜索字符串并根据前 m 个字符连续构建前缀。当 m 前缀的熵“足够高”时,稍后使用它。
从某种意义上说,这就像逆向霍夫曼编码。
Are you willing to spend a significant time preprocessing the string?
If you are, what you can do is build a list of n-grams with offsets.
Suppose your alphabet is hex bytes and you are using 1-grams.
Then for 00-ff, you can create a dictionary that looks like this(perlese, sorry)
where you walk down the string and build the @array_of_offsets from all points where bytes happen. You can do this for arbitrary n-grams.
This provides a "start point for search" that you can use to walk.
Of course, the downside is that you have to preprocess the string, so that's your tradeoff.
edit:
The basic idea here is to match prefixes. This may bomb out badly if the information is super-similar, but if it has a fair amount of divergence between n-grams, you should be able to match prefixes pretty well.
Let's quantify divergence, since you've not discussed the kind of info you're analyzing. For the purposes of this algorithm, we can characterize divergence as a distance function: you need a decently high Hamming distance. If the hamming distance between n-grams is, say, 1, the above idea won't work. But if it's n-1, the above algorithm will be much easier.
To improve on my algorithm, let's build an algorithm that does some successive elimination of possibilities:
We can invoke Shannon Entropy to define information of a given n-gram. Take your search string and successively build a prefix based upon the first m characters. When the entropy of the m-prefix is 'sufficiently high', use it later.
In a sense, this is like reversing Huffman encoding.
一种有效但复杂的方法是使用 Burrows-Wheeler 变换进行全文索引。它涉及对源文本执行 BWT,然后使用其上的小索引来快速查找文本中与输入模式匹配的任何子字符串。
该算法的时间复杂度大致为 O(n),与您匹配的字符串的长度有关 - 并且与输入字符串的长度无关!此外,索引的大小不会比输入数据大很多,并且通过压缩甚至可以减小到低于源文本的大小。
One efficient but complex way is full-text indexing with the Burrows-Wheeler transform. It involves performing a BWT on your source text, then using a small index on that to quickly find any substring in the text that matches your input pattern.
The time complexity of this algorithm is roughly O(n) with the length of the string you're matching - and independent of the length of the input string! Further, the size of the index is not much larger than the input data, and with compression can even be reduced below the size of the source text.
借助无限内存,您可以对每个 1k 字符串及其在 1 GB 文件中的位置进行哈希处理。
如果内存小于无限,您将受到搜索时接触的内存页面数量的限制。
With infinite memory, you can hash every 1k string along with its position in the 1 GB file.
With less than infinite memory, you will be bounded by how many memory pages you touch when searching.
我不确定字符串的
find()
方法是否比 Python 的re
(正则表达式)提供的search()
方法更快) 模块,但只有一种方法可以找到答案。如果您只是搜索一个字符串,那么您想要的是这样的:
但是,如果您确实只想要第一个匹配项,那么您最好使用
finditer()
,它返回一个迭代器,并使用如此大规模的行动实际上可能会更好。I don't know definitively if the
find()
method for strings is faster than thesearch()
method provided by Python'sre
(regular expressions) module, but there's only one way to find out.If you're just searching a string, what you want is this:
However, if you really only want the first match, you might be better off using
finditer()
, which returns an iterator, and with such large operations might actually be better.http://www.youtube.com/watch?v=V5hZoJ6uK-s
对您来说将是最有价值的。这是麻省理工学院关于动态规划的讲座
http://www.youtube.com/watch?v=V5hZoJ6uK-s
Will be of most value to you. Its an MIT lecture on Dynamic Programming
如果模式相当随机,您可以预先计算字符串的 n 个前缀的位置。
无需检查 n 前缀的所有选项,只需使用 1GB 字符串中的实际前缀 - 这些前缀将少于 1Gig。使用适合您内存的尽可能大的前缀,我没有 16GB RAM 来检查,但 4 的前缀可以工作(至少在内存高效的数据结构中),如果没有尝试 3 甚至 2
。 1GB 字符串和随机 1KB 模式,如果使用 3 字节前缀,每个前缀应该获得几十个位置,但 4 字节前缀应该获得平均值 0 或 1 ,因此查找应该很快。
预计算位置
查找模式
If the patterns are fairly random, you can precompute the location of n-prefixes of strings.
Instead of going over all options for n-prefixes, just use the actual ones in the 1GB string - there will be less than 1Gig of those. Use as big a prefix as fits in your memory, I don't have 16GB RAM to check but a prefix of 4 could work (at least in a memory-efficient data structures), if not try 3 or even 2.
For a random 1GB string and random 1KB patterns, you should get a few 10s of locations per prefix if you use 3-byte prefixes, but 4-byte prefixes should get you an average of 0 or 1 , so lookup should be fast.
Precompute Locations
Look up a pattern
有人暗示如果你有足够的 RAM(甚至可能是磁盘/交换)可用,可以使用一种可能的方法来索引这个东西。
想象一下,如果您对从原始 Gig 字符串中的每个字符延伸的 1 K 块执行简单的 32 位 CRC。这将导致每个字节从数据开头偏移 4 个字节的校验和数据。
就其本身而言,这可能会适度提高搜索速度。每个 1 K 搜索目标的校验和可以根据每个 CRC 进行检查……每次冲突都会测试其是否真正匹配。这仍然比正常的线性搜索快几个数量级。
显然,这需要 4 GB RAM 用于 CRC 阵列(加上用于原始数据的原始 Gig 以及环境和程序的更多开销)。
如果我们有 ~16 GB,我们可以对校验和进行排序并存储每个找到的偏移量列表。这就变成了索引搜索(每个目标搜索平均约 16 个探测...最坏的情况约为 32 或 33 个(可能是那里的栅栏柱)。16
GB 文件索引可能仍会比线性校验和搜索提供更好的性能并且它几乎肯定比线性原始搜索更好(除非您的文件系统/存储非常慢)
(添加):我应该澄清的是,只有在您描述了需要对文件系统/存储进行多次搜索的情况下,此策略才有用。 。
您可以使用线程方法来构建索引(同时读取索引并让多个线程执行校验和)。您还可以将索引卸载到单独的进程或节点集群中(特别是如果您使用) 基于文件的索引 --- 上面描述的 ~16 GB 选项,使用简单的 32 位 CRC,您可以像读取器线程获取数据一样快地执行校验和/索引(但我们正在讨论的是)。每 1 K 数据有 1024 个校验和,所以也许不是)。
您可以通过用 C 语言编写 Python 模块来实际执行搜索……和/或可能执行校验和/索引来进一步提高性能。
显然,此类 C 扩展的开发和测试需要其他权衡。听起来这的可重用性几乎为零。
Someone hinted at a possible way to index this thing if you have abundant RAM (or possibly even disk/swap) available.
Imagine if you performed a simple 32-bit CRC on a 1 K block extending from each character in the original Gig string. This would result in 4 bytes of checksum data for each byte offset from the beginning of the data.
By itself this might give a modest improvement in search speed. The checksum of each 1 K search target could be checked against each CRC ... which each collision tested for a true match. That should still be a couple orders of magnitude faster than a normal linear search.
That, obviously, costs us 4 GB of RAM for the CRC array (plus the original Gig for the original data and a little more overhead for the environment and our program).
If we have ~16 GB we could sort the checksums and store a list of offsets where each is found. That becomes an indexed search (average of about 16 probes per target search ... worst case around 32 or 33 (might be a fence post there).
It's possible that a 16 GB file index would still give better performance than a linear checksum search and it would almost certainly be better than a linear raw search (unless you have extremely slow filesystems/storage).
(Adding): I should clarify that this strategy is only beneficial given that you've described a need to do many searches on the same one gigabyte data blob.
You might use a threaded approach to building the index (while reading it as well as having multiple threads performing the checksumming). You might also offload the indexing into separate processes or a cluster of nodes (particularly if you use a file-based index --- the ~16 GB option described above). With a simple 32-bit CRC you might be able to perform the checksums/indexing as fast as your reader thread can get the data (but we are talking about 1024 checksums for each 1 K of data so perhaps not).
You might further improve performance by coding a Python module in C for actually performing the search ... and/or possibly for performing the checksumming/indexing.
The development and testing of such C extensions entail other trade-offs, obviously enough. It sounds like this would have near zero re-usability.
在这个问题中,我们有一个名为
1_gb_string
的 1GB 任意数据字符串,我们需要搜索无限数量的固定宽度 1-KB 模式 (1_kb_pattern
) 在此字符串中。考虑到高达 16 GB 的内存限制,目标是有效地找到每个模式的第一次出现。为此,我们可以使用以下方法:
第 1 步:定义一个采用
pattern
和的函数
作为输入。search_pattern_in_large_string
Large_string第 2 步:设置从大字符串中读取的块大小。在本例中,我们选择 1024 字节(1 KB)的块大小。
第 3 步:打开大字符串文件并分块迭代它,直到到达文件末尾。
第 4 步:使用正则表达式 (
re
) 在大字符串的每个块中搜索模式。第 5 步:如果找到匹配项,则返回匹配的模式。这将提供该模式在大字符串中的第一次出现。
第 6 步:如果在搜索整个大字符串后未找到该模式,则返回
None
。第 7 步: 现在,您可以通过传递要搜索的模式 (
pattern_to_search
) 和要搜索的文件路径来使用search_pattern_in_large_string
函数。大字符串(large_string_file_path
)。该函数将返回该模式的第一次出现,如果未找到该模式,则返回None
。通过实施此方法,您可以在考虑内存限制的同时,在 1 GB 或更大的数据字符串中有效地搜索 1 KB 模式。该函数以可管理的块的形式读取字符串,利用正则表达式来查找模式的第一次出现。
In this problem, we have a 1-gigabyte string of arbitrary data named
1_gb_string
, and we need to search for an infinite number of fixed-width 1-kilobyte patterns (1_kb_pattern
) within this string. The goal is to find the first occurrence of each pattern efficiently, considering memory limitations of up to 16 GB.To accomplish this, we can use the following approach:
Step 1: Define a function
search_pattern_in_large_string
that takes thepattern
andlarge_string
as input.Step 2: Set the chunk size to read from the large string. In this case, we choose a chunk size of 1024 bytes (1 kilobyte).
Step 3: Open the large string file and iterate through it in chunks until the end of the file is reached.
Step 4: Use regular expressions (
re
) to search for the pattern within each chunk of the large string.Step 5: If a match is found, return the matched pattern. This will provide the first occurrence of the pattern in the large string.
Step 6: If the pattern is not found after searching through the entire large string, return
None
.Step 7: Now, you can use the
search_pattern_in_large_string
function by passing the pattern to search (pattern_to_search
) and the file path of the large string (large_string_file_path
). The function will return the first occurrence of the pattern orNone
if the pattern is not found.By implementing this approach, you can efficiently search for 1-kilobyte patterns within a 1-gigabyte or larger data string while considering memory limitations. The function reads the string in manageable chunks, utilizing regular expressions to find the pattern's first occurrence.