阅读文档时使用并行算法
可能的重复:
提高预处理大量文档的性能
嗨, 我有一个包含大约 100 个文档的文档集。我必须预处理每个文档并将这些文档相互比较。如果我按顺序进行,将消耗大量时间。所以我想知道一些可以使用的并行算法以及如何使用 Java 实现这些算法。
拉加兹, 女湾
Possible Duplicate:
Improving performance of preprocessing large set of documents
Hi,
I have a document set contain about 100 documents. I have to preprocess each of these documents and compare these documents with each other. If I do it in sequential manner it will consume huge amount of time. So I want to know some parellel algorithms that can be used and how can i implement those using Java.
Ragards,
nuwan
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
有很多关于检测文档相似性的文献。您需要进行文献检索和/或网络搜索来查找符合您要求的软件/算法/技术。
简单地用强力并行成对比较替换强力顺序成对比较并不是答案。这种方法只能给您带来
O(P)
加速(最多),您必须处理O(N^2 * S^2)
,其中 N 是文档数量,S
是平均文档大小。首先,查找两个大型文本文件之间相似性的经典方法包括将每个文件分成行,计算每个文件行的哈希值,对哈希值进行排序并进行比较。这个过程是
O(SlogS)
...There is a lot of literature about detecting document similarity. You need to do a literature search and/or a web search for software / algorithms / techniques that matches your requirements.
Simply replacing a brute-force sequential pair-wise comparison with a brute-force parallel pair-wise comparison is not the answer. That approach only gives you an
O(P)
speedup (at best), where you have to deal withO(N^2 * S^2)
where N is the number of documents andS
is the average document size.For a start, the classic way of finding similarities between two large text files involves breaking each file into lines, calculating hashes of each the respective file's lines, sorting the hashes and comparing them. This process is
O(SlogS)
...如果您有文档 d1、d2、d3、d4 - 如果您将每个文档与所有其他文档进行比较,那么它将是
O(N^2)
。但是,我假设比较 d1 与 d2 与比较 d2 与 d1 相同,因此您可以在那里进行优化。所以基本上,你只需要比较 d1-d2、d1-d3、d1-d4、d2-d3、d2-d4、d3-d4,即O((N-1)!
) 。或许可以从构建所有需要进行的比较的地图开始。然后,将该映射拆分为 X 个大小相等的集合,其中 X 是要运行的进程数。最后,分出那么多线程(或将工作分配给那么多服务器),让它们运行,然后将结果合并在一起。
如果您需要单独预处理每个文档(因此此时比较实际上并不重要),那么只需将问题分解为所需的多个进程,然后将工作分配到各个进程即可。由于不真正了解您正在处理哪种预处理、比较和文档类型,我无法真正了解比这更多的细节。
If you have documents d1, d2, d3, d4 - if you compared each document with all other documents, then it would be
O(N^2)
. However, I assume that comparing d1 to d2 is the same as comparing d2 to d1, so you can optimize there. So basically, you only need to compare d1-d2, d1-d3, d1-d4, d2-d3, d2-d4, d3-d4, which isO((N-1)!
).Perhaps start by building a map of all comparisons that need to be done. Then, split that map into X equal size collections, where X is the number of processes you want to run. Finally, spin off that many threads (or farm the work out to that many servers), and let them run, then merge the results back together.
If you need to preprocess each document individually (so the comparisons really don't matter at that point), then just break the problem up into as many processes as you want, and distribute that work across the processes. Without really know what kind of preprocessing and comparison and document types you're dealing with, I can't really get into much more specifics than that.
我假设您正在寻找文档之间的相似性而不是相同的文档 - 如果是这种情况,您可以并行为每个文档生成校验和,然后进行比较会相对容易。
对于相似之处,您可以使用指纹识别方法。我有一个朋友如何使用它来寻找大型文档语料库中的文本重用。您可以并行计算每个文档的指纹,然后加载指纹以在内存中并行进行匹配。
筛选:文档指纹识别的本地算法
I'm assuming your looking for similarities between documents rather than identical documents - if that were the case you could generate a checksum for each document in parallel and then comparing then would be relatively easy.
For similarities you could use a fingerprinting approach. I have a friend how uses this for looking for text reuse in a large corpus of documents. You can calculate the fingerprints for each document in parallel and then load the fingerprints to do the match in memory and parallel.
Winnowing: Local Algorithms for Document Fingerprinting