用于计算各种语言的单词数的 PHP 库/类?

发布于 2024-09-03 14:19:30 字数 809 浏览 8 评论 0原文

在不久的将来的某个时候,我将需要实现跨语言字数统计,或者如果不可能,则实现跨语言字符统计。

我所说的字数统计是指根据文本的语言,对给定文本中包含的单词进行准确的计数。文本的语言由用户设置,并且将被假定为正确的。

我所说的字符计数是指给定文本中包含的“可能在一个单词中”字符的计数,具有上述相同的语言信息。

我更喜欢前一种计数,但我知道其中的困难。我也知道后一种计数要容易得多,但如果可能的话,我更喜欢前者。

如果我只需要看英语,我会很高兴,但我需要考虑这里的每种语言,中文、韩语、英语、阿拉伯语、印地语等等。

我想知道 Stack Overflow 是否有任何关于从哪里开始寻找现有产品/方法来在 PHP 中执行此操作的线索,因为我是一个很好的懒惰程序员*

一个简单的测试显示str_word_count如何与set_locale不起作用,并且来自 php.net 的 str_word_count 页面的函数。

*http://blogscoped.com/archive/2005-08-24-n14。 html

Some time in the near future I will need to implement a cross-language word count, or if that is not possible, a cross-language character count.

By word count I mean an accurate count of the words contained within the given text, taking the language of the text. The language of the text is set by a user, and will be assumed to be correct.

By character count I mean a count of the "possibly in a word" characters contained within the given text, with the same language information described above.

I would much prefer the former count, but I am aware of the difficulties involved. I am also aware that the latter count is much easier, but very much prefer the former, if at all possible.

I'd love it if I just had to look at English, but I need to consider every language here, Chinese, Korean, English, Arabic, Hindi, and so on.

I would like to know if Stack Overflow has any leads on where to start looking for an existing product / method to do this in PHP, as I am a good lazy programmer*

A simple test showing how str_word_count with set_locale doesn't work, and a function from php.net's str_word_count page.

*http://blogoscoped.com/archive/2005-08-24-n14.html

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

看轻我的陪伴 2024-09-10 14:19:30

计算字符很容易:

echo strlen('一个有十的字符的句子'); // 30 (WRONG!)
echo strlen(utf8_decode('一个有十的字符的句子')); // 10

计算单词是事情开始变得棘手的地方,特别是对于中文、日语和其他不使用空格(或其他常见的“单词边界”字符)作为单词分隔符的语言。我不会说中文,也不明白中文中的字数统计是如何工作的,所以你必须教育我一些——这些语言中的单词是由什么组成的?它是任何特定的字符或字符集吗?我记得读过一些有关识别 T9 写作中的日语单词有多难的内容,但现在找不到了。

以下内容应正确返回使用空格或标点符号作为单词分隔符的语言中的单词数:

count(preg_split('~[\p{Z}\p{P}]+~u', $string, null, PREG_SPLIT_NO_EMPTY));

Counting chars is easy:

echo strlen('一个有十的字符的句子'); // 30 (WRONG!)
echo strlen(utf8_decode('一个有十的字符的句子')); // 10

Counting words is where things start to get tricky, specially for Chinese, Japanese and other languages that don't use spaces (or other common "word boundary" characters) as word separators. I don't speak Chinese and I don't understand how word counting works in Chinese, so you'll have to educate me a bit - what makes a word in these languages? Is it any specific char or set of chars? I remember reading something related to how hard it was to identify Japanese words in T9 writing but can't find it anymore.

The following should correctly return the number of words in languages that use spaces or punctuation chars as words separators:

count(preg_split('~[\p{Z}\p{P}]+~u', $string, null, PREG_SPLIT_NO_EMPTY));
追风人 2024-09-10 14:19:30

如果您只想要近似而不是精确的单词,一个快速技巧是

<?php echo count(explode(' ',$string)); ?>

通过计算任何语言中的空格来工作。我已将其用于翻译脚本。同样,它不会计算确切的单词数,而是给出段落中的近似单词数。

A quick trick if you only want approximate and not exact words is

<?php echo count(explode(' ',$string)); ?>

It works by counting spaces in just any language. I have used this for a translator script. Again it will not count exact words but give approximate words in a para.

很糊涂小朋友 2024-09-10 14:19:30

好吧,尝试一下:

<?
function count_words($str){
     $words = 0;
     $str = eregi_replace(" +", " ", $str);
     $array = explode(" ", $str);
     for($i=0;$i < count($array);$i++)
      {
         if (eregi("[0-9A-Za-zÀ-ÖØ-öø-ÿ]", $array[$i]))
             $words++;
     }
     return $words;
 }
 echo count_words('This is the second one , it will count wrong as well" , it will count 12 instead of 11 because the comma is counted too.');
 ?>

Well, try:

<?
function count_words($str){
     $words = 0;
     $str = eregi_replace(" +", " ", $str);
     $array = explode(" ", $str);
     for($i=0;$i < count($array);$i++)
      {
         if (eregi("[0-9A-Za-zÀ-ÖØ-öø-ÿ]", $array[$i]))
             $words++;
     }
     return $words;
 }
 echo count_words('This is the second one , it will count wrong as well" , it will count 12 instead of 11 because the comma is counted too.');
 ?>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文