使用 PHP 在文本正文中查找 3-8 个单词的常用短语
我正在寻找一种使用 PHP 在文本正文中查找常用短语的方法。如果在 php 中不可能,我会对其他可以帮助我完成此任务的网络语言感兴趣。
内存或速度都不是问题。
现在,我可以轻松找到关键字,但不知道如何搜索短语。
I'm looking for a way to find common phrases within a body of text using PHP. If it's not possible in php, I'd be interested in other web languages that would help me complete this.
Memory or speed are not an issues.
Right now, I'm able to easily find keywords, but don't know how to go about searching phrases.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我已经编写了一个 PHP 脚本来执行此操作,就在这里。它首先进行拆分将源文本转换为单词数组及其出现次数。然后它会计算具有指定参数的这些单词的常见序列。这是旧代码并且没有注释,但也许您会发现它很有用。
I've written a PHP script that does just that, right here. It first splits the source text into an array of words and their occurrence count. Then it counts common sequences of those words with the specified parameters. It's old code and not commented, but maybe you'll find it useful.
只使用 PHP?我能想到的最简单的方法是:
我对于正式的 CS 来说是垃圾,但我相信这是
n^2
复杂性,特别是涉及n(n-1)/2
最坏情况下的比较。我毫不怀疑有更好的方法可以做到这一点,但你提到效率不是问题,所以这就可以了。代码如下(我使用了一个新函数,array_keys接受搜索参数):
查看实际操作:http://ideone.com/htDSC
Using just PHP? The most straightforward I can come up with is:
I'm trash for formal CS, but I believe this is of
n^2
complexity, specifically involvingn(n-1)/2
comparisons in the worst case. I have no doubt there is some better way to do this, but you mentioned that efficiency is a non-issue, so this'll do.Code follows (I used a new function to me, array_keys that accepts a search parameter):
View it in action: http://ideone.com/htDSC
我认为你应该去
str_word_count
会给
然后你可以使用
array_count_values()
这会给你
I think you should go for
str_word_count
will give
Then you can use
array_count_values()
which will give you
一个丑陋的解决方案,因为你说丑陋是可以的,那就是搜索你的任何短语的第一个单词。然后,找到该单词后,检查它后面的下一个单词是否与短语中的下一个预期单词匹配。这将是一个循环,只要命中是肯定的,就会一直持续下去,直到单词不存在或短语完成为止。
简单,但极其丑陋,而且可能非常非常慢。
An ugly solution, since you said ugly is ok, would be to search for the first word for any of your phrases. Then, once that word is found, check if the next word past it matches the next expected word in the phrase. This would be a loop that would keep going so long as the hits are positive until either a word is not present or the phrase is completed.
Simple, but exceedingly ugly and probably very, very slow.
来晚了,但由于我在想做类似的事情时偶然发现了这一点,我想我应该分享一下我在 2019 年的进展:
https://packagist.org/packages/yooper/php-text-analysis
这个库使我的任务变得非常简单。就我而言,我有一系列搜索短语,最后将它们分解为单个术语,进行标准化,然后创建两个和三个单词的 ngram。循环遍历生成的 ngram,我能够轻松总结特定短语的频率。
非常酷的图书馆,提供很多东西。
Coming in late here, but since I stumbled upon this while looking to do a similar thing, I thought I'd share where I landed in 2019:
https://packagist.org/packages/yooper/php-text-analysis
This library made my task downright trivial. In my case, I had an array of search phrases that I wound up breaking up into single terms, normalizing, then creating two and three-word ngrams. Looping through the resulting ngrams, I was able to easily summarize the frequency of specific phrases.
Really cool library with a lot to offer.
如果您想在 html 文件中进行全文搜索,请使用 Sphinx - 强大的搜索服务器。
文档位于此处
If you want fulltext search in html files, use Sphinx - powerful search server.
Documentation is here