首次出现并行字符串匹配算法
坦白说,这是家庭作业。话虽这么说,它是极其开放的,并且对于如何开始思考这个问题(或一般的并行算法),我们的指导几乎为零。我想要正确方向的指针,但不是完整的解决方案。任何有帮助的读物也将是极好的。
我正在研究一种有效的方法,使用并行算法来匹配大量文本中模式的第一次出现。该模式是简单的字符匹配,不涉及正则表达式。我设法找到一种可能的方法来查找所有匹配项,但这需要我查看所有匹配项并找到第一个匹配项。
所以问题是,我能否更成功地在进程之间分解文本并以这种方式进行扫描?或者最好进行某种进程同步搜索,其中第 j 个进程搜索模式的第 j 个字符?如果所有进程都为其匹配返回 true,则进程将更改其在匹配所述模式中的位置并再次向上移动,继续直到所有字符都已匹配,然后返回第一个匹配的索引。
到目前为止,我所拥有的都是非常基本的,而且很可能不起作用。我不会实现这个,但任何指示将不胜感激。
对于 p 个处理器、长度为 t 的文本、长度为 L 的模式以及使用的 L 个处理器的上限:
for i=0 to t-l: for j=0 to p: processor j compares the text[i+j] to pattern[i+j] On false match: all processors terminate current comparison, i++ On true match by all processors: Iterate p characters at a time until L characters have been compared If all L comparisons return true: return i (position of pattern) Else: i++
To be up front, this is homework. That being said, it's extremely open ended and we've had almost zero guidance as to how to even begin thinking about this problem (or parallel algorithms in general). I'd like pointers in the right direction and not a full solution. Any reading that could help would be excellent as well.
I'm working on an efficient way to match the first occurrence of a pattern in a large amount of text using a parallel algorithm. The pattern is simple character matching, no regex involved. I've managed to come up with a possible way of finding all of the matches, but that then requires that I look through all of the matches and find the first one.
So the question is, will I have more success breaking the text up between processes and scanning that way? Or would it be best to have process-synchronized searching of some sort where the j'th process searches for the j'th character of the pattern? If then all processes return true for their match, the processes would change their position in matching said pattern and move up again, continuing until all characters have been matched and then returning the index of the first match.
What I have so far is extremely basic, and more than likely does not work. I won't be implementing this, but any pointers would be appreciated.
With p processors, a text of length t, and a pattern of length L, and a ceiling of L processors used:
for i=0 to t-l: for j=0 to p: processor j compares the text[i+j] to pattern[i+j] On false match: all processors terminate current comparison, i++ On true match by all processors: Iterate p characters at a time until L characters have been compared If all L comparisons return true: return i (position of pattern) Else: i++
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
给定长度为 L 的模式,并在 P 个处理器上搜索长度为 N 的字符串,我只需将字符串拆分到处理器上即可。每个处理器将占用一个长度为 N/P + L-1 的块,最后一个 L-1 与属于下一个处理器的字符串重叠。然后每个处理器将执行 boyer moore(两个预处理表将共享)。当每个进程完成时,它们会将结果返回给第一个处理器,该处理器维护一个表。
在所有进程都响应后(或者稍微考虑一下您可以提前逃脱),您将返回第一个匹配项。平均应该是 O(N/(L*P) + P)。
让第 i 个处理器匹配第 i 个字符的方法将需要太多的进程间通信开销。
编辑:我意识到你已经有了一个解决方案,并且正在找出一种方法,而不必找到所有解决方案。嗯,我真的不认为这种方法是必要的。您可以想出一些早期的转义条件,它们并不那么困难,但我认为它们一般不会提高您的性能(除非您对文本中匹配的分布有一些额外的了解)。
Given a pattern of length L, and searching in a string of length N over P processors I would just split the string over the processors. Each processor would take a chunk of length N/P + L-1, with the last L-1 overlapping the string belonging to the next processor. Then each processor would perform boyer moore (the two pre-processing tables would be shared). When each finishes, they will return the result to the first processor, which maintains a table
After all processes have responded (or with a bit of thought you can have an early escape), you return the first match. This should be on average O(N/(L*P) + P).
The approach of having the i'th processor matching the i'th character would require too much inter process communication overhead.
EDIT: I realize you already have a solution, and are figuring out a way without having to find all solutions. Well I don't really think this approach is necessary. You can come up with some early escape conditions, they aren't that difficult, but I don't think they'll improve your performance that much in general (unless you have some additional knowledge the distribution of matches in your text).
恐怕断了弦也不行。
一般来说,早期转义是很困难的,所以你最好将文本分成几块。
但是,让我们首先请 Herb Sutter 在 Dr Dobbs 上解释使用并行算法进行搜索。这个想法是利用分布的不均匀性来获得早期回报。当然,萨特对任何比赛都感兴趣,这不是眼前的问题,所以让我们适应一下。
这是我的想法,假设我们有:
N
p
的文本 处理器max
是一个块应包含的最大字符数,可能比模式长度 M 大一个数量级。现在,您想要的是将文本分割为
k
相等的块,其中k
是最小的,而size(chunk)
是最大但较差的至最大值
。然后,我们有一个经典的生产者-消费者模式:p 进程被提供文本块,每个进程在它接收到的块中查找模式。
早期的逃脱是通过旗帜来完成的。您可以设置在其中找到模式(及其位置)的块的索引,或者您可以仅设置一个布尔值,并将结果存储在进程本身中(在这种情况下,您必须遍历所有进程一旦停止)。要点是,每次请求一个块时,生产者都会检查该标志,如果发现匹配,则停止向进程提供数据(因为进程已按顺序获得了块)。
让我们举个例子,有 3 个处理器:
块
6
和8
都包含字符串。生产者首先将 1、2 和 3 喂给进程,然后每个进程将按照自己的节奏前进(这取决于搜索到的文本和模式的相似度)。
假设我们先在
8
中找到该模式,然后再在6
中找到该模式。然后正在处理7
的进程结束并尝试获取另一个块,生产者停止它 -->这是无关紧要的。然后,6
上的处理结束,并产生结果,因此我们知道第一个出现在6
中,并且我们知道了它的位置。关键的想法是你不想看全文!太浪费了!
I am afraid that breaking the string will not do.
Generally speaking, early escaping is difficult, so you'd be better off breaking the text in chunks.
But let's ask Herb Sutter to explain searching with parallel algorithms first on Dr Dobbs. The idea is to use the non-uniformity of the distribution to have an early return. Of course Sutter is interested in any match, which is not the problem at hand, so let's adapt.
Here is my idea, let's say we have:
N
p
Processorsmax
is the maximum number of characters a chunk should contain, probably an order of magnitude greater thanM
the length of the pattern.Now, what you want is to split your text into
k
equal chunks, wherek
is is minimal andsize(chunk)
is maximal yet inferior tomax
.Then, we have a classical
Producer-Consumer
pattern: thep
processes are feeded with the chunks of text, each process looking for the pattern in the chunk it receives.The early escape is done by having a flag. You can either set the index of the chunk in which you found the pattern (and its position), or you can just set a boolean, and store the result in the processes themselves (in which case you'll have to go through all the processes once they have stop). The point is that each time a chunk is requested, the producer checks the flag, and stop feeding the processes if a match has been found (since the processes have been given the chunks in order).
Let's have an example, with 3 processors:
The chunks
6
and8
both contain the string.The producer will first feed 1, 2 and 3 to the processes, then each process will advance at its own rhythm (it depends on the similarity of the text searched and the pattern).
Let's say we find the pattern in
8
before we find it in6
. Then the process that was working on7
ends and tries to get another chunk, the producer stops it --> it would be irrelevant. Then the process working on6
ends, with a result, and thus we know that the first occurrence was in6
, and we have its position.The key idea is that you don't want to look at the whole text! It's wasteful!