提高预处理大型文档集的性能
我正在开发一个与使用 Java 的抄袭检测框架相关的项目。我的文档集包含大约 100 个文档,我必须对它们进行预处理并存储在合适的数据结构中。我有一个大问题,即如何有效地处理大量文档并避免瓶颈。我的问题主要集中在如何提高预处理性能。
谢谢
问候 女宛
I am working on a project related to plagiarism detection framework using Java. My document set contains about 100 documents and I have to preprocess them and store in a suitable data structure. I have a big question that how am i going to process the large set of documents efficiently and avoiding bottlenecks . The main focus on my question is how to improve the preprocessing performance.
Thanks
Regards
Nuwan
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你那里有点缺乏细节。适当的优化将取决于文档格式、平均文档大小、处理它们的方式以及在数据结构中存储的信息类型等因素。不知道其中任何一个,一些一般的优化是:
假设给定文档的预处理独立于任何其他文档的预处理,并且假设您正在运行多核 CPU,那么您的工作负载是多线程的良好候选者。为每个 CPU 核心分配一个线程,并将作业分配给您的线程。然后您可以并行处理多个文档。
更一般地说,尽可能多地在内存中进行操作。尽量避免读取/写入磁盘。如果必须写入磁盘,请尝试等到拥有要写入的所有数据,然后将其全部写入一个批次。
You're a bit lacking on specifics there. Appropriate optimizations are going to depend upon things like the document format, the average document size, how you are processing them, and what sort of information you are storing in your data structure. Not knowing any of them, some general optimizations are:
Assuming that the pre-processing of a given document is independent of the pre-processing of any other document, and assuming you are running a multi-core CPU, then your workload is a good candidate for multi-threading. Allocate one thread per CPU core, and farm out jobs to your threads. Then you can process multiple documents in parallel.
More generally, do as much in memory as you can. Try to avoid reading from/writing to disk as much as possible. If you must write to disk, try to wait until you have all the data you want to write, and then write it all in a single batch.
您提供的信息很少,无法提出任何好的建议。
我的默认设置是使用带有线程池的执行器来处理它们,该线程池的线程数与机器中处理文档的每个线程的核心数相同。
You give very little information on which to make any good suggestions.
My default would be to process them using an executor with a thread pool with the same number of threads as cores in your machine each thread processing a document.