不同字计数器之间差异的原因是什么

发布于 2024-11-29 07:43:39 字数 368 浏览 1 评论 0原文

我用 Ruby 创建了一个单词计数器，作为学习 Ruby 的一个小练习。

我使用过 JavaScriptKit.com 和 WordCountTool.com 上的单词计数器以及 Open Office Writer 中的单词计数器。

一些文本产生以下结果

OpenOffice: 458 words
JavaScriptKit: 453 words
WordCountTool: 455 words
Mine: 461 words

我的问题是：为什么所有计数器中相同的精确摘录的计数不同？

脚本中存在哪些问题可能会导致计数不准确但仍然接近？

我可以通过哪些方法改进我的脚本，使其更加准确？

原文

I created a word counter in Ruby as a little exercise in learning Ruby.

I've used the word counters on JavaScriptKit.com and WordCountTool.com as well as the one in Open Office Writer.

Some text produced the following results

OpenOffice: 458 words
JavaScriptKit: 453 words
WordCountTool: 455 words
Mine: 461 words

My question is this: Why do the counts differ for the same exact excerpt across all counters?

What are problems in a script that might cause an inaccurate, but still close count?

What are some ways I could improve upon my script so that it's more accurate?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

沉睡月亮 2024-12-06 07:43:39

您实际上是在要求“单词”的定义，出于计数目的，它可能意味着非常不同的事物。让我们以您的原始帖子为例。

最简单的计数工具是

text.split.count                      #=> 111

然而，如果您输入“为什么相同[...]的计数不同/变化”会怎样？嗯，显然“差异/变化”是两个单词，所以我们应该将正斜杠算作单词分隔符。事实上，仅仅因为我忘记在句号和下一个单词之间添加空格，并不会使它们成为同一个单词，所以让我们也将句号作为分隔符。但我懒得去检查它是否是一个 URL，所以你提到的那些网站必须算作两个单词：

text.split(/[\s\.\/\?]+/).count       #=> 113

好吧，这很酷，但实际上数字在技术上并不是单词 - 如果它们被说出来，458 将是“四百五十八”实际上是5个字。所以我们也对它们打折吧

text.split(/[\s\.\/\?0-9]+/).count    #=> 109

你明白了。您得到的结果仅相差 8 个单词 - 显然它们对单词的定义并没有那么不同。但字数统计只是一个粗略的指导，所以不必担心差异。

You're really asking for a definition of a "word", which for counting purposes could mean very different things. Let's take your original post as an example.

The most simplistic counting tool would be

text.split.count                      #=> 111

Yet what if you had put "Why do the counts differ/change for the same[...]"? Well, clearly "differ/change" is two words, so we should probably count forward slashes as word delimiters. In fact, just because I forgot to put a space between a full stop and the next word, doesn't make them the same word, so let's include full stops as delimiters too. Yet I can't be bothered to check whether it's a URL, so those websites you mention will have to count as two words:

text.split(/[\s\.\/\?]+/).count       #=> 113

Ok, that's cool, but actually numbers are not technically words - and if they were spoken, 458 would be "four hundred and fifty eight" which is actually 5 words. So let's discount them too

text.split(/[\s\.\/\?0-9]+/).count    #=> 109

You get the idea. The results you got only differed by 8 words - so clearly their definitions of a word are not all that different. But word counts are only ever a rough guide, so don't worry about the discrepancies.

回复收藏 0 原文