不同字计数器之间差异的原因是什么

发布于 2024-11-29 07:43:39 字数 368 浏览 1 评论 0原文

我用 Ruby 创建了一个单词计数器,作为学习 Ruby 的一个小练习。

我使用过 JavaScriptKit.com 和 WordCountTool.com 上的单词计数器以及 Open Office Writer 中的单词计数器。

一些文本产生以下结果

OpenOffice: 458 words
JavaScriptKit: 453 words
WordCountTool: 455 words
Mine: 461 words

我的问题是:为什么所有计数器中相同的精确摘录的计数不同?

脚本中存在哪些问题可能会导致计数不准确但仍然接近?

我可以通过哪些方法改进我的脚本,使其更加准确?

I created a word counter in Ruby as a little exercise in learning Ruby.

I've used the word counters on JavaScriptKit.com and WordCountTool.com as well as the one in Open Office Writer.

Some text produced the following results

OpenOffice: 458 words
JavaScriptKit: 453 words
WordCountTool: 455 words
Mine: 461 words

My question is this: Why do the counts differ for the same exact excerpt across all counters?

What are problems in a script that might cause an inaccurate, but still close count?

What are some ways I could improve upon my script so that it's more accurate?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

沉睡月亮 2024-12-06 07:43:39

您实际上是在要求“单词”的定义,出于计数目的,它可能意味着非常不同的事物。让我们以您的原始帖子为例。

最简单的计数工具是

text.split.count                      #=> 111

然而,如果您输入“为什么相同[...]的计数不同/变化”会怎样?嗯,显然“差异/变化”是两个单词,所以我们应该将正斜杠算作单词分隔符。事实上,仅仅因为我忘记在句号和下一个单词之间添加空格,并不会使它们成为同一个单词,所以让我们也将句号作为分隔符。但我懒得去检查它是否是一个 URL,所以你提到的那些网站必须算作两个单词:

text.split(/[\s\.\/\?]+/).count       #=> 113

好吧,这很酷,但实际上数字在技术上并不是单词 - 如果它们被说出来,458 将是“四百五十八”实际上是5个字。所以我们也对它们打折吧

text.split(/[\s\.\/\?0-9]+/).count    #=> 109

你明白了。您得到的结果仅相差 8 个单词 - 显然它们对单词的定义并没有那么不同。但字数统计只是一个粗略的指导,所以不必担心差异。

You're really asking for a definition of a "word", which for counting purposes could mean very different things. Let's take your original post as an example.

The most simplistic counting tool would be

text.split.count                      #=> 111

Yet what if you had put "Why do the counts differ/change for the same[...]"? Well, clearly "differ/change" is two words, so we should probably count forward slashes as word delimiters. In fact, just because I forgot to put a space between a full stop and the next word, doesn't make them the same word, so let's include full stops as delimiters too. Yet I can't be bothered to check whether it's a URL, so those websites you mention will have to count as two words:

text.split(/[\s\.\/\?]+/).count       #=> 113

Ok, that's cool, but actually numbers are not technically words - and if they were spoken, 458 would be "four hundred and fifty eight" which is actually 5 words. So let's discount them too

text.split(/[\s\.\/\?0-9]+/).count    #=> 109

You get the idea. The results you got only differed by 8 words - so clearly their definitions of a word are not all that different. But word counts are only ever a rough guide, so don't worry about the discrepancies.

沉溺在你眼里的海 2024-12-06 07:43:39

您将得到不同的结果,具体取决于 WC 的作者决定的“单词”。某些类型的标点符号可以被归类为单词分隔符,具体取决于计数器以及空格、换行符等...

来自 WC 的维基百科文章的一些信息 http://en.wikipedia.org/wiki/Word_count

不同的字数统计程序可能会给出不同的结果,具体取决于
关于“词”的定义

You'll get different results depending on what the author of the WC has decided to be a 'word'. Certain types of punctuation could be classed as a word seperator depending on the counter as a well as whitespaces, newlines etc...

Some info from the wikipedia article on WC http://en.wikipedia.org/wiki/Word_count

Different word counting programs may give varying results, depending
on the definition of "word"

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文