当前位置：文江博客话题详情

使用 PHP 在文本正文中查找 3-8 个单词的常用短语

发布于 2024-10-14 04:19:46 字数 128 浏览 6 评论 0原文

我正在寻找一种使用 PHP 在文本正文中查找常用短语的方法。如果在 php 中不可能，我会对其他可以帮助我完成此任务的网络语言感兴趣。

内存或速度都不是问题。

现在，我可以轻松找到关键字，但不知道如何搜索短语。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

草莓味的萝莉 2024-10-21 04:19:46

我已经编写了一个 PHP 脚本来执行此操作，就在这里。它首先进行拆分将源文本转换为单词数组及其出现次数。然后它会计算具有指定参数的这些单词的常见序列。这是旧代码并且没有注释，但也许您会发现它很有用。

回复收藏 0 原文

呆头 2024-10-21 04:19:46

只使用 PHP？我能想到的最简单的方法是：

将每个短语添加到数组中
从数组中获取第一个短语并将其删除
找到与其匹配的短语数并删除它们，保留匹配数
推送短语和匹配数匹配到新数组
重复直到初始数组为空

我对于正式的 CS 来说是垃圾，但我相信这是 n^2 复杂性，特别是涉及 n(n-1)/2 最坏情况下的比较。我毫不怀疑有更好的方法可以做到这一点，但你提到效率不是问题，所以这就可以了。

代码如下（我使用了一个新函数，array_keys接受搜索参数）：

// assign the source text to $text
$text = file_get_contents('mytext.txt');

// there are other ways to do this, like preg_match_all,
// but this is computationally the simplest
$phrases = explode('.', $text);

// filter the phrases
// if you're in PHP5, you can use a foreach loop here
$num_phrases = count($phrases);
for($i = 0; $i < $num_phrases; $i++) {
  $phrases[$i] = trim($phrases[$i]);
}

$counts = array();

while(count($phrases) > 0) {
  $p = array_shift($phrases);
  $keys = array_keys($phrases, $p);
  $c = count($keys);
  $counts[$p] = $c + 1;

  if($c > 0) {
    foreach($keys as $key) {
      unset($phrases[$key]);
    }
  }
}

print_r($counts);

查看实际操作：http://ideone.com/htDSC

Using just PHP? The most straightforward I can come up with is:

Add each phrase to an array
Get the first phrase from the array and remove it
Find the number of phrases that match it and remove those, keeping a count of matches
Push the phrase and the number of matches to a new array
Repeat until initial array is empty

I'm trash for formal CS, but I believe this is of n^2 complexity, specifically involving n(n-1)/2 comparisons in the worst case. I have no doubt there is some better way to do this, but you mentioned that efficiency is a non-issue, so this'll do.

Code follows (I used a new function to me, array_keys that accepts a search parameter):

// assign the source text to $text
$text = file_get_contents('mytext.txt');

// there are other ways to do this, like preg_match_all,
// but this is computationally the simplest
$phrases = explode('.', $text);

// filter the phrases
// if you're in PHP5, you can use a foreach loop here
$num_phrases = count($phrases);
for($i = 0; $i < $num_phrases; $i++) {
  $phrases[$i] = trim($phrases[$i]);
}

$counts = array();

while(count($phrases) > 0) {
  $p = array_shift($phrases);
  $keys = array_keys($phrases, $p);
  $c = count($keys);
  $counts[$p] = $c + 1;

  if($c > 0) {
    foreach($keys as $key) {
      unset($phrases[$key]);
    }
  }
}

print_r($counts);

View it in action: http://ideone.com/htDSC

回复收藏 0 原文

沦落红尘 2024-10-21 04:19:46

我认为你应该去

str_word_count

$str = "Hello friend, you're
       looking          good today!";

print_r(str_word_count($str, 1));

会给

Array
(
    [0] => Hello
    [1] => friend
    [2] => you're
    [3] => looking
    [4] => good
    [5] => today
)

然后你可以使用array_count_values()

$array = array(1, "hello", 1, "world", "hello");
print_r(array_count_values($array));

这会给你

Array
(
    [1] => 2
    [hello] => 2
    [world] => 1
)

I think you should go for

str_word_count

$str = "Hello friend, you're
       looking          good today!";

print_r(str_word_count($str, 1));

will give

Array
(
    [0] => Hello
    [1] => friend
    [2] => you're
    [3] => looking
    [4] => good
    [5] => today
)

Then you can use array_count_values()

$array = array(1, "hello", 1, "world", "hello");
print_r(array_count_values($array));

which will give you

Array
(
    [1] => 2
    [hello] => 2
    [world] => 1
)

回复收藏 0 原文

握住你手 2024-10-21 04:19:46

一个丑陋的解决方案，因为你说丑陋是可以的，那就是搜索你的任何短语的第一个单词。然后，找到该单词后，检查它后面的下一个单词是否与短语中的下一个预期单词匹配。这将是一个循环，只要命中是肯定的，就会一直持续下去，直到单词不存在或短语完成为止。

简单，但极其丑陋，而且可能非常非常慢。

回复收藏 0 原文

玉环 2024-10-21 04:19:46

来晚了，但由于我在想做类似的事情时偶然发现了这一点，我想我应该分享一下我在 2019 年的进展：

https://packagist.org/packages/yooper/php-text-analysis

这个库使我的任务变得非常简单。就我而言，我有一系列搜索短语，最后将它们分解为单个术语，进行标准化，然后创建两个和三个单词的 ngram。循环遍历生成的 ngram，我能够轻松总结特定短语的频率。

$words   = tokenize($searchPhraseText);
$words   = normalize_tokens($words);
$ngram2  = array_unique(ngrams($words, 2));
$ngram3  = array_unique(ngrams($words, 3));

非常酷的图书馆，提供很多东西。

Coming in late here, but since I stumbled upon this while looking to do a similar thing, I thought I'd share where I landed in 2019:

https://packagist.org/packages/yooper/php-text-analysis

This library made my task downright trivial. In my case, I had an array of search phrases that I wound up breaking up into single terms, normalizing, then creating two and three-word ngrams. Looping through the resulting ngrams, I was able to easily summarize the frequency of specific phrases.

$words   = tokenize($searchPhraseText);
$words   = normalize_tokens($words);
$ngram2  = array_unique(ngrams($words, 2));
$ngram3  = array_unique(ngrams($words, 3));

Really cool library with a lot to offer.

回复收藏 0 原文