使用 PHP 在文本正文中查找 3-8 个单词的常用短语

发布于 2024-10-14 04:19:46 字数 128 浏览 2 评论 0原文

我正在寻找一种使用 PHP 在文本正文中查找常用短语的方法。如果在 php 中不可能,我会对其他可以帮助我完成此任务的网络语言感兴趣。

内存或速度都不是问题。

现在,我可以轻松找到关键字,但不知道如何搜索短语。

I'm looking for a way to find common phrases within a body of text using PHP. If it's not possible in php, I'd be interested in other web languages that would help me complete this.

Memory or speed are not an issues.

Right now, I'm able to easily find keywords, but don't know how to go about searching phrases.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

草莓味的萝莉 2024-10-21 04:19:46

我已经编写了一个 PHP 脚本来执行此操作,就在这里。它首先进行拆分将源文本转换为单词数组及其出现次数。然后它会计算具有指定参数的这些单词的常见序列。这是旧代码并且没有注释,但也许您会发现它很有用。

I've written a PHP script that does just that, right here. It first splits the source text into an array of words and their occurrence count. Then it counts common sequences of those words with the specified parameters. It's old code and not commented, but maybe you'll find it useful.

呆头 2024-10-21 04:19:46

只使用 PHP?我能想到的最简单的方法是:

  • 将每个短语添加到数组中
  • 从数组中获取第一个短语并将其删除
  • 找到与其匹配的短语数并删除它们,保留匹配数
  • 推送短语和匹配数匹配到新数组
  • 重复直到初始数组为空

我对于正式的 CS 来说是垃圾,但我相信这是 n^2 复杂性,特别是涉及 n(n-1)/2 最坏情况下的比较。我毫不怀疑有更好的方法可以做到这一点,但你提到效率不是问题,所以这就可以了。

代码如下(我使用了一个新函数,array_keys接受搜索参数):

// assign the source text to $text
$text = file_get_contents('mytext.txt');

// there are other ways to do this, like preg_match_all,
// but this is computationally the simplest
$phrases = explode('.', $text);

// filter the phrases
// if you're in PHP5, you can use a foreach loop here
$num_phrases = count($phrases);
for($i = 0; $i < $num_phrases; $i++) {
  $phrases[$i] = trim($phrases[$i]);
}

$counts = array();

while(count($phrases) > 0) {
  $p = array_shift($phrases);
  $keys = array_keys($phrases, $p);
  $c = count($keys);
  $counts[$p] = $c + 1;

  if($c > 0) {
    foreach($keys as $key) {
      unset($phrases[$key]);
    }
  }
}

print_r($counts);

查看实际操作:http://ideone.com/htDSC

Using just PHP? The most straightforward I can come up with is:

  • Add each phrase to an array
  • Get the first phrase from the array and remove it
  • Find the number of phrases that match it and remove those, keeping a count of matches
  • Push the phrase and the number of matches to a new array
  • Repeat until initial array is empty

I'm trash for formal CS, but I believe this is of n^2 complexity, specifically involving n(n-1)/2 comparisons in the worst case. I have no doubt there is some better way to do this, but you mentioned that efficiency is a non-issue, so this'll do.

Code follows (I used a new function to me, array_keys that accepts a search parameter):

// assign the source text to $text
$text = file_get_contents('mytext.txt');

// there are other ways to do this, like preg_match_all,
// but this is computationally the simplest
$phrases = explode('.', $text);

// filter the phrases
// if you're in PHP5, you can use a foreach loop here
$num_phrases = count($phrases);
for($i = 0; $i < $num_phrases; $i++) {
  $phrases[$i] = trim($phrases[$i]);
}

$counts = array();

while(count($phrases) > 0) {
  $p = array_shift($phrases);
  $keys = array_keys($phrases, $p);
  $c = count($keys);
  $counts[$p] = $c + 1;

  if($c > 0) {
    foreach($keys as $key) {
      unset($phrases[$key]);
    }
  }
}

print_r($counts);

View it in action: http://ideone.com/htDSC

沦落红尘 2024-10-21 04:19:46

我认为你应该去

str_word_count

$str = "Hello friend, you're
       looking          good today!";

print_r(str_word_count($str, 1));

会给

Array
(
    [0] => Hello
    [1] => friend
    [2] => you're
    [3] => looking
    [4] => good
    [5] => today
)

然后你可以使用array_count_values()

$array = array(1, "hello", 1, "world", "hello");
print_r(array_count_values($array));

这会给你

Array
(
    [1] => 2
    [hello] => 2
    [world] => 1
)

I think you should go for

str_word_count

$str = "Hello friend, you're
       looking          good today!";

print_r(str_word_count($str, 1));

will give

Array
(
    [0] => Hello
    [1] => friend
    [2] => you're
    [3] => looking
    [4] => good
    [5] => today
)

Then you can use array_count_values()

$array = array(1, "hello", 1, "world", "hello");
print_r(array_count_values($array));

which will give you

Array
(
    [1] => 2
    [hello] => 2
    [world] => 1
)
握住你手 2024-10-21 04:19:46

一个丑陋的解决方案,因为你说丑陋是可以的,那就是搜索你的任何短语的第一个单词。然后,找到该单词后,检查它后面的下一个单词是否与短语中的下一个预期单词匹配。这将是一个循环,只要命中是肯定的,就会一直持续下去,直到单词不存在或短语完成为止。

简单,但极其丑陋,而且可能非常非常慢。

An ugly solution, since you said ugly is ok, would be to search for the first word for any of your phrases. Then, once that word is found, check if the next word past it matches the next expected word in the phrase. This would be a loop that would keep going so long as the hits are positive until either a word is not present or the phrase is completed.

Simple, but exceedingly ugly and probably very, very slow.

玉环 2024-10-21 04:19:46

来晚了,但由于我在想做类似的事情时偶然发现了这一点,我想我应该分享一下我在 2019 年的进展:

https://packagist.org/packages/yooper/php-text-analysis

这个库使我的任务变得非常简单。就我而言,我有一系列搜索短语,最后将它们分解为单个术语,进行标准化,然后创建两个和三个单词的 ngram。循环遍历生成的 ngram,我能够轻松总结特定短语的频率。

$words   = tokenize($searchPhraseText);
$words   = normalize_tokens($words);
$ngram2  = array_unique(ngrams($words, 2));
$ngram3  = array_unique(ngrams($words, 3));

非常酷的图书馆,提供很多东西。

Coming in late here, but since I stumbled upon this while looking to do a similar thing, I thought I'd share where I landed in 2019:

https://packagist.org/packages/yooper/php-text-analysis

This library made my task downright trivial. In my case, I had an array of search phrases that I wound up breaking up into single terms, normalizing, then creating two and three-word ngrams. Looping through the resulting ngrams, I was able to easily summarize the frequency of specific phrases.

$words   = tokenize($searchPhraseText);
$words   = normalize_tokens($words);
$ngram2  = array_unique(ngrams($words, 2));
$ngram3  = array_unique(ngrams($words, 3));

Really cool library with a lot to offer.

酷炫老祖宗 2024-10-21 04:19:46

如果您想在 html 文件中进行全文搜索,请使用 Sphinx - 强大的搜索服务器。
文档位于此处

If you want fulltext search in html files, use Sphinx - powerful search server.
Documentation is here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文