php 文本中最常用的单词

发布于 2024-09-08 07:31:16 字数 408 浏览 14 评论 0原文

我在 stackoverflow 上找到了下面的代码，它可以很好地查找字符串中最常见的单词。但我可以排除对“a、if、you、have 等”等常用词的计数吗？或者我必须在计数后删除元素吗？我该怎么做？提前致谢。

<?php

$text = "A very nice to tot to text. Something nice to think about if you're into text.";


$words = str_word_count($text, 1); 

$frequency = array_count_values($words);

arsort($frequency);

echo '<pre>';
print_r($frequency);
echo '</pre>';
?>

原文

I found the code below on stackoverflow and it works well in finding the most common words in a string. But can I exclude the counting on common words like "a, if, you, have, etc"? Or would I have to remove the elements after counting? How would I do this? Thanks in advance.

<?php

$text = "A very nice to tot to text. Something nice to think about if you're into text.";


$words = str_word_count($text, 1); 

$frequency = array_count_values($words);

arsort($frequency);

echo '<pre>';
print_r($frequency);
echo '</pre>';
?>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

自演自醉 2024-09-15 07:31:16

这是一个从字符串中提取常用单词的函数。它需要三个参数；字符串、停用词数组和关键字计数。你必须使用php函数从txt文件中获取stop_words，该函数将txt文件放入数组中

$stop_words = file('stop_words.txt', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
$this->extract_common_words($text, $stop_words)

您可以使用此文件 stop_words .txt 作为您的主要停用词文件，或创建您自己的文件。

function extract_common_words($string, $stop_words, $max_count = 5) {
      $string = preg_replace('/ss+/i', '', $string);
      $string = trim($string); // trim the string
      $string = preg_replace('/[^a-zA-Z -]/', '', $string); // only take alphabet characters, but keep the spaces and dashes too…
      $string = strtolower($string); // make it lowercase
    
      preg_match_all('/\b.*?\b/i', $string, $match_words);
      $match_words = $match_words[0];
       
      foreach ( $match_words as $key => $item ) {
          if ( $item == '' || in_array(strtolower($item), $stop_words) || strlen($item) <= 3 ) {
              unset($match_words[$key]);
          }
      }  
       
      $word_count = str_word_count( implode(" ", $match_words) , 1); 
      $frequency = array_count_values($word_count);
      arsort($frequency);
      
      //arsort($word_count_arr);
      $keywords = array_slice($frequency, 0, $max_count);
      return $keywords;
}

This is a function that extract common words from a string. it takes three parameters; string, stop words array and keywords count. you have to get the stop_words from txt file using php function that take txt file into array

$stop_words = file('stop_words.txt', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
$this->extract_common_words( $text, $stop_words)

You can use this file stop_words.txt as your primary stop words file, or create your own file.

function extract_common_words($string, $stop_words, $max_count = 5) {
      $string = preg_replace('/ss+/i', '', $string);
      $string = trim($string); // trim the string
      $string = preg_replace('/[^a-zA-Z -]/', '', $string); // only take alphabet characters, but keep the spaces and dashes too…
      $string = strtolower($string); // make it lowercase
    
      preg_match_all('/\b.*?\b/i', $string, $match_words);
      $match_words = $match_words[0];
       
      foreach ( $match_words as $key => $item ) {
          if ( $item == '' || in_array(strtolower($item), $stop_words) || strlen($item) <= 3 ) {
              unset($match_words[$key]);
          }
      }  
       
      $word_count = str_word_count( implode(" ", $match_words) , 1); 
      $frequency = array_count_values($word_count);
      arsort($frequency);
      
      //arsort($word_count_arr);
      $keywords = array_slice($frequency, 0, $max_count);
      return $keywords;
}

回复收藏 0 原文

梨涡 2024-09-15 07:31:16

以下是我使用内置 PHP 函数的解决方案：

most_frequent_words — 查找字符串中出现最频繁的单词

function most_frequent_words($string, $stop_words = [], $limit = 5) {
    $string = strtolower($string); // Make string lowercase

    $words = str_word_count($string, 1); // Returns an array containing all the words found inside the string
    $words = array_diff($words, $stop_words); // Remove black-list words from the array
    $words = array_count_values($words); // Count the number of occurrence

    arsort($words); // Sort based on count

    return array_slice($words, 0, $limit); // Limit the number of words and returns the word array
}

返回数组包含字符串中出现最频繁的单词。

参数：

string $string - 输入字符串。

array $stop_words （可选）- 从数组中过滤掉的单词列表，默认为空数组。

string $limit（可选）- 限制返回的单词数，默认 5。

Here is my solution by using the built-in PHP functions:

most_frequent_words — Find most frequent word(s) appeared in a String

function most_frequent_words($string, $stop_words = [], $limit = 5) {
    $string = strtolower($string); // Make string lowercase

    $words = str_word_count($string, 1); // Returns an array containing all the words found inside the string
    $words = array_diff($words, $stop_words); // Remove black-list words from the array
    $words = array_count_values($words); // Count the number of occurrence

    arsort($words); // Sort based on count

    return array_slice($words, 0, $limit); // Limit the number of words and returns the word array
}

Returns array contains word(s) appeared most frequently in the string.

Parameters :

string $string - The input string.

array $stop_words (optional) - List of words which are filtered out from the array, Default empty array.

string $limit (optional) - Limit the number of words returned, Default 5.

回复收藏 0 原文

相思故 2024-09-15 07:31:16

没有其他参数或本机 PHP 函数可以传递要排除的单词。因此，我只会使用您拥有的内容并忽略 str_word_count 返回的自定义单词集。

回复收藏 0 原文

时光匆匆的小流年 2024-09-15 07:31:16

您可以使用 array_diff()：

$words = array("if", "you", "do", "this", 'I', 'do', 'that');
$stopwords = array("a", "you", "if");

print_r(array_diff($words, $stopwords));

给出

 Array
(
    [2] => do
    [3] => this
    [4] => I
    [5] => do
    [6] => that
)

但是你必须自己处理小写和大写。这里最简单的方法是
预先将文本转换为小写。

You can do this easily by using array_diff():

$words = array("if", "you", "do", "this", 'I', 'do', 'that');
$stopwords = array("a", "you", "if");

print_r(array_diff($words, $stopwords));

gives

 Array
(
    [2] => do
    [3] => this
    [4] => I
    [5] => do
    [6] => that
)

But you have to take care of lower and upper case yourself. The easiest way here would be to
convert the text to lowercase beforehand.

回复收藏 0 原文

~没有更多了~

关于作者

莫言歌

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

php 文本中最常用的单词

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

参数：

Parameters :

关于作者

相关话题

热门标签

推荐作者

Promise

qq_lbRlsh

待＂谢繁草

yy2010hell

漫无边际

傲娇萝莉攻

友情链接

php 文本中最常用的单词

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

参数：

Parameters :

关于作者

相关话题

热门标签

推荐作者

Promise

qq_lbRlsh

待＂谢繁草

yy2010hell

漫无边际

傲娇萝莉攻

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。