当前位置：文江博客话题详情

用于计算各种语言的单词数的 PHP 库/类？

发布于 2024-09-03 14:19:30 字数 809 浏览 8 评论 0原文

在不久的将来的某个时候，我将需要实现跨语言字数统计，或者如果不可能，则实现跨语言字符统计。

我所说的字数统计是指根据文本的语言，对给定文本中包含的单词进行准确的计数。文本的语言由用户设置，并且将被假定为正确的。

我所说的字符计数是指给定文本中包含的“可能在一个单词中”字符的计数，具有上述相同的语言信息。

我更喜欢前一种计数，但我知道其中的困难。我也知道后一种计数要容易得多，但如果可能的话，我更喜欢前者。

如果我只需要看英语，我会很高兴，但我需要考虑这里的每种语言，中文、韩语、英语、阿拉伯语、印地语等等。

我想知道 Stack Overflow 是否有任何关于从哪里开始寻找现有产品/方法来在 PHP 中执行此操作的线索，因为我是一个很好的懒惰程序员*

一个简单的测试显示str_word_count如何与set_locale不起作用，并且来自 php.net 的 str_word_count 页面的函数。

*http://blogscoped.com/archive/2005-08-24-n14。 html

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

看轻我的陪伴 2024-09-10 14:19:30

计算字符很容易：

echo strlen('一个有十的字符的句子'); // 30 (WRONG!)
echo strlen(utf8_decode('一个有十的字符的句子')); // 10

计算单词是事情开始变得棘手的地方，特别是对于中文、日语和其他不使用空格（或其他常见的“单词边界”字符）作为单词分隔符的语言。我不会说中文，也不明白中文中的字数统计是如何工作的，所以你必须教育我一些——这些语言中的单词是由什么组成的？它是任何特定的字符或字符集吗？我记得读过一些有关识别 T9 写作中的日语单词有多难的内容，但现在找不到了。

以下内容应正确返回使用空格或标点符号作为单词分隔符的语言中的单词数：

count(preg_split('~[\p{Z}\p{P}]+~u', $string, null, PREG_SPLIT_NO_EMPTY));

Counting chars is easy:

echo strlen('一个有十的字符的句子'); // 30 (WRONG!)
echo strlen(utf8_decode('一个有十的字符的句子')); // 10

Counting words is where things start to get tricky, specially for Chinese, Japanese and other languages that don't use spaces (or other common "word boundary" characters) as word separators. I don't speak Chinese and I don't understand how word counting works in Chinese, so you'll have to educate me a bit - what makes a word in these languages? Is it any specific char or set of chars? I remember reading something related to how hard it was to identify Japanese words in T9 writing but can't find it anymore.

The following should correctly return the number of words in languages that use spaces or punctuation chars as words separators:

count(preg_split('~[\p{Z}\p{P}]+~u', $string, null, PREG_SPLIT_NO_EMPTY));

回复收藏 0 原文

追风人 2024-09-10 14:19:30

如果您只想要近似而不是精确的单词，一个快速技巧是

<?php echo count(explode(' ',$string)); ?>

通过计算任何语言中的空格来工作。我已将其用于翻译脚本。同样，它不会计算确切的单词数，而是给出段落中的近似单词数。

A quick trick if you only want approximate and not exact words is

<?php echo count(explode(' ',$string)); ?>

It works by counting spaces in just any language. I have used this for a translator script. Again it will not count exact words but give approximate words in a para.

回复收藏 0 原文

很糊涂小朋友 2024-09-10 14:19:30

好吧，尝试一下：

<?
function count_words($str){
     $words = 0;
     $str = eregi_replace(" +", " ", $str);
     $array = explode(" ", $str);
     for($i=0;$i < count($array);$i++)
      {
         if (eregi("[0-9A-Za-zÀ-ÖØ-öø-ÿ]", $array[$i]))
             $words++;
     }
     return $words;
 }
 echo count_words('This is the second one , it will count wrong as well" , it will count 12 instead of 11 because the comma is counted too.');
 ?>

Well, try:

<?
function count_words($str){
     $words = 0;
     $str = eregi_replace(" +", " ", $str);
     $array = explode(" ", $str);
     for($i=0;$i < count($array);$i++)
      {
         if (eregi("[0-9A-Za-zÀ-ÖØ-öø-ÿ]", $array[$i]))
             $words++;
     }
     return $words;
 }
 echo count_words('This is the second one , it will count wrong as well" , it will count 12 instead of 11 because the comma is counted too.');
 ?>

回复收藏 0 原文

~没有更多了~