搜索引擎软件的线程和进程架构所需的建议
该软件是一个经典的搜索引擎。应用程序的一部分负责爬行/收集数据,另一部分则获取该数据并构建索引或数据库。最后一部分处理来自客户端的查询,并在检索结果之前对数据执行搜索。
我正在讨论的特定引擎是数据经常更新的引擎(至少每分钟一次),因此查询必须始终对最新数据进行操作。
我的问题很简单。这三个任务应该由三个独立的进程处理,还是由一个进程处理,每个进程有多个线程专用?
我的问题的主要原因是关于分区内存的最佳方法。如果搜索器必须更新索引器的可用数据,并且索引器必须更新查询处理程序的数据集,那么它们都位于同一进程下并具有相同的地址空间是否有意义?或者使用共享内存映射文件的单独进程是否可以接受?
我倾向于单独的进程,以便每个进程都可以运行在不同的机器上,从而实现集群、分发等。但是就较小数据集的原始速度而言,是否会首选整合方法?
操作系统是Windows,语言是C++。
The software is a classic search engine. There is one portion of the app that is tasked with crawling/collecting data, and there is another that takes that data and builds an index or database. The final portion handles queries from clients, and performs a search on the data, before retrieving the results.
The specific engine that I'm discussing is one where the data is frequently updated (at least once per minute) so the queries must always be operating on the latest data.
My question is simple. Should these three tasks be handled by three separate processes, or a single process with multiple threads dedicated to each?
The main reason for my question is regarding the best way to partition memory. If the searcher has to update the available data for the indexer, and the indexer has to update the datasets for the query handler, would it make sense for them all to live under the same process and have the same address space? Or would it be acceptable to have separate processes that use shared memory mapped files?
I am leaning towards separate processes so that each can live on a different machine, enabling clustering, distribution, etc. But in terms of raw speed for smaller datasets, would a consolidated approach be preferred?
The OS is Windows, the language is C++.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我不是专家,但我倾向于采用单独的流程方法,因为这提供了最佳的灵活性、扩展能力、易于管理(重新启动一项服务不会影响其他服务)和性能。
我也很想考虑使用不同的数据库来完成不同的任务。
如果您采取让一个组件完成一项工作的方法 - 并且做得很好,那么将这一原则应用于数据库也是有意义的。
这取决于您在哪里看到性能瓶颈以及如何做到这一点。
我正在考虑初始收集区域,也许是暂存区域(排序等)和专用于快速访问和搜索的最终区域。
我认为 SQL 到 SQL 批处理/ETL 会提供最佳性能。
经过深思熟虑 - 我将构建 3 个单独的应用程序,它们共同构成了解决方案。如果您确实愿意,这还允许您使用不同的技术来完成不同的任务。允许更灵活的维护路径。
I'm no expert but I would be leaning towards the seperate processes approach as that gives the best flexibility, ability to scale, easy to manage (restarting one service wouldn't afect the others) and performance.
I'd also be tempted to consider different databases for the different tasks as well.
If you take the approach of having one component doing one job - and doing it well, then it makes sense to apply this principle to the DB as well.
I it depends on where you see the performance bottle-knecks being as to how you do that.
I'm thinking along the lines of an initial collection area, perhaps a staging area (sorting, etc) and a final area dedicated to fast access and searching.
SQL to SQL batch processes / ETL would give best performance I guess.
Thinking it through - I'd build 3 seperate application that together formed the solution. That would also allow you to use different technology for different tasks if you really wanted to. Allows a more flexible maintenance path.