PHP 中的多个关键字(100 到 1000)搜索(字符串搜索算法)
我在我的 PHP 项目中需要解决这个问题,其中一些关键字(从几百到几千,长度可能会有所不同)需要在大约 100-300 个字符长的字符串中搜索,有时长度较小,为 30-50 个字符。我可以预处理关键字以重新用于搜索字符串的新实例。我对 PHP 有点陌生,没有在 PHP 库中找到执行此操作的方法。经过一番搜索,我在 Aho Corasick 算法中找到了一些不错的候选算法,然后是 Sun Wu 和 Udi Manber 的改进,它似乎也被称为 agrep (或者是 agrep 的一部分): http://webglimpse.net/pubs/TR94-17.pdf
有拉宾·卡普,后缀树等也一样,但它们看起来不太合适,因为第一个是固定长度关键字,而后者似乎很通用,需要相当多的工作。
谁能告诉我在 php 中自己实现 Agrep/Sun Wu-Manber 是否是解决这个问题的好方法?还有其他反馈吗?
编辑:正如我在下面的评论中提到的,有数百个或更多不同的搜索关键字,因此正则表达式无济于事。所以这个回应没有帮助。
I have this problem to solve in my PHP project where some keywords (from a few hundreds to a few thousands, lengths can vary) need to be searched in a string about 100-300 characters long, sometimes of lesser length 30-50 chars. I can preprocess the keywords for reusing for new instances of search strings. I am kind of new to PHP and did not find a way to do this in the PHP library. Doing a bit of searching, I found a few good candidates in Aho Corasick algorithm and then this improvement by Sun Wu and Udi Manber, which also seems to be known as agrep (or is a part of agrep): http://webglimpse.net/pubs/TR94-17.pdf
There is Rabin Karp, Suffix Trees etc too but they did not look quite suitable as first was for fixed length keywords and latter seems quite generic and will need rather a lot of work.
Can anyone let me know if implementing the Agrep/Sun Wu-Manber on my own in php is a good way to solve this problem? Any other feedback?
EDIT: as I mentioned below in a comment, there are hundreds or more of distinct search keywords, so regex will not help. So that response is not helpful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为你可以通过使用“Levenshtein distance”度量来解决这个问题。
来自维基百科;
另外,PHP 有一个 levenshtein() 方法。使用您的关键字列表作为数组 &可搜索字符串作为输入并迭代数组,并在每次迭代中使用 levenshtein() 进行匹配。
I think you can solve this problem by using "Levenshtein distance" metric.
From wikipedia;
Plus, PHP has a levenshtein() method. Use your keyword list as array & searchable string as input and iterate over your array and use levenshtein() in each iteration for matching.
从 PHP 5.5 开始,PHP 的
strtr
使用 Wu-Manbers 算法进行多模式匹配。请参阅 PHP git 存储库中的 提交 ccf15cf2有关实施的详细信息。根据我的经验,这是相当有效的。Aho-Corasick 算法的纯 PHP 实现可在此处获取:https://packagist.org/包/wikimedia/aho-corasick
As of PHP 5.5, PHP's
strtr
uses the Wu-Manbers algorithm for multi-pattern matching. See commit ccf15cf2 in the PHP git repository for details about the implementation. It is quite efficient, in my experience.A pure-PHP implementation of the Aho-Corasick algorithm is available here: https://packagist.org/packages/wikimedia/aho-corasick