如何使用 Perl 对平面文件进行全文搜索?
我们有一个基于 Perl 的 Web 应用程序,其数据源自大量平面文本文件存储库。 这些平面文件被放置到我们系统上的一个目录中,我们广泛地解析它们,将一些信息插入到 MySQL 数据库中,然后将这些文件移动到它们的存档存储库和永久主目录 (/www/website/archive/*.txt)。 现在,我们不会解析这些平面文件中的每一个数据位,并且一些更模糊的数据项不会被数据库化。
当前的要求是用户能够从 Perl 生成的网页对整个平面文件存储库执行全文搜索,并返回一个命中列表,然后他们可以单击并打开文本文件来查找审查。
实现这种搜索功能的最优雅、最高效且非 CPU 密集型方法是什么?
We have a Perl-based web application whose data originates from a vast repository of flat text files. Those flat files are placed into a directory on our system, we extensively parse them inserting bits of information into a MySQL database, and subsequently move those files to their archived repository and permanent home (/www/website/archive/*.txt). Now, we don't parse every single bit of data from these flat files and some of the more obscure data items don't get databased.
The requirement currently out there is for users to be able to perform a full-text search of the entire flat-file repository from a Perl-generated webpage and bring back a list of hits that they could then click on and open the text files for review.
What is the most elegant, efficient and non CPU intensive method to enable this searching capability?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我建议按以下顺序:
将每个文档的全部内容放入 MySQL 表中,并使用 MySQL 的全文搜索和索引功能。 我从来没有这样做过,但 MySQL 总是能够处理比我能处理的更多的事情。
Swish-E 仍然存在并被设计用于构建全文索引并允许对结果进行排名。 我已经运行它几年了,它运行得很好。
您可以在 Perl 代码中使用
File::Find
来像grep -r
一样浏览存储库,但与上面的索引选项之一相比,它会很糟糕。 但是,它会起作用,甚至可能会让您感到惊讶:)I'd recommend, in this order:
Suck the whole of every document into a MySQL table and use MySQL's full-text search and indexing features. I've never done it but MySQL has always been able to handle more than I can throw at it.
Swish-E still exists and is designed for building full-text indexes and allowing ranked results. I've been running it for a few years and it works pretty well.
You can use
File::Find
in your Perl code to chew through the repository likegrep -r
, but it will suck compared to one of the indexed options above. However, it will work, and might even surprise you :)我建议使用专用的搜索引擎来进行索引和搜索。
我最近没有看过搜索引擎,但几年前我使用过 ht://dig,并对结果感到满意。
更新:目前看来 ht://dig 是一个僵尸项目。 您可能想使用其他引擎。 Hyper Estraier,除了难以发音之外,看起来很有前途。
I recommend using a dedicated search engine to do your indexing and searches.
I haven't looked at search engines recently, but I used ht://dig a few years ago, and was happy with the results.
Update: It looks like ht://dig is a zombie project at this point. You may want to use another engine. Hyper Estraier, besides being unpronounceable looks promising.
我赞同添加索引机的建议。 考虑 http://namazu.org 中的 Namazu。 当我需要它时,它看起来比 Swish-e、ht://dig 更容易上手,我对此非常满意。
如果您不想要索引器的开销,请考虑分叉 grep/egrep。 一旦文本量达到数兆字节,这将比仅在 Perl 中扫描要快得多,例如:
额外的好处:使用日期/标签等文件名约定来减少 grep 的文件集。
笨重的
查找... | xargs ...
旨在解决通配符扩展的 shell 大小限制,您可能会遇到大档案。I second the recommendation to add an indexing machine. Consider Namazu from http://namazu.org. When I needed it, it looked easier to get started than Swish-e, ht://dig and I'm quite content with it.
If you don't want the overhead of an indexer, look at forking a grep/egrep. Once the text volume goes to multi-megabytes, this will be significantly faster than scanning solely in Perl, e.g:
Bonus: use file name conventions like dates/tags/etc to reduce the set of files to grep.
The clunky
find ... | xargs ...
is meant to work around the shell size limits on wildcard expansion which you might hit with big archives.我看到有人推荐Lucene/Plucene。 查看 KinoSearch,我一直在使用它在基于 Catalyst 的项目上工作了一年或更长时间,对性能和编程/维护的简便性非常满意。
应根据您的情况考虑该页面上的警告,但我可以证明该模块的稳定性。
I see someone recommended Lucene/Plucene. Check out KinoSearch, I have been using this for a year or more on a Catalyst-based project, very happy with the performance and ease of programming/maintenance.
The caveat on that page should be considered for your circumstance, but I can attest to the module's stability.