不同字计数器之间差异的原因是什么
我用 Ruby 创建了一个单词计数器,作为学习 Ruby 的一个小练习。
我使用过 JavaScriptKit.com 和 WordCountTool.com 上的单词计数器以及 Open Office Writer 中的单词计数器。
一些文本产生以下结果
OpenOffice: 458 words
JavaScriptKit: 453 words
WordCountTool: 455 words
Mine: 461 words
我的问题是:为什么所有计数器中相同的精确摘录的计数不同?
脚本中存在哪些问题可能会导致计数不准确但仍然接近?
我可以通过哪些方法改进我的脚本,使其更加准确?
I created a word counter in Ruby as a little exercise in learning Ruby.
I've used the word counters on JavaScriptKit.com and WordCountTool.com as well as the one in Open Office Writer.
Some text produced the following results
OpenOffice: 458 words
JavaScriptKit: 453 words
WordCountTool: 455 words
Mine: 461 words
My question is this: Why do the counts differ for the same exact excerpt across all counters?
What are problems in a script that might cause an inaccurate, but still close count?
What are some ways I could improve upon my script so that it's more accurate?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您实际上是在要求“单词”的定义,出于计数目的,它可能意味着非常不同的事物。让我们以您的原始帖子为例。
最简单的计数工具是
然而,如果您输入
“为什么相同[...]的计数不同/变化”
会怎样?嗯,显然“差异/变化”是两个单词,所以我们应该将正斜杠算作单词分隔符。事实上,仅仅因为我忘记在句号和下一个单词之间添加空格,并不会使它们成为同一个单词,所以让我们也将句号作为分隔符。但我懒得去检查它是否是一个 URL,所以你提到的那些网站必须算作两个单词:好吧,这很酷,但实际上数字在技术上并不是单词 - 如果它们被说出来,458 将是“四百五十八”实际上是5个字。所以我们也对它们打折吧
你明白了。您得到的结果仅相差 8 个单词 - 显然它们对单词的定义并没有那么不同。但字数统计只是一个粗略的指导,所以不必担心差异。
You're really asking for a definition of a "word", which for counting purposes could mean very different things. Let's take your original post as an example.
The most simplistic counting tool would be
Yet what if you had put
"Why do the counts differ/change for the same[...]"
? Well, clearly "differ/change" is two words, so we should probably count forward slashes as word delimiters. In fact, just because I forgot to put a space between a full stop and the next word, doesn't make them the same word, so let's include full stops as delimiters too. Yet I can't be bothered to check whether it's a URL, so those websites you mention will have to count as two words:Ok, that's cool, but actually numbers are not technically words - and if they were spoken, 458 would be "four hundred and fifty eight" which is actually 5 words. So let's discount them too
You get the idea. The results you got only differed by 8 words - so clearly their definitions of a word are not all that different. But word counts are only ever a rough guide, so don't worry about the discrepancies.
您将得到不同的结果,具体取决于 WC 的作者决定的“单词”。某些类型的标点符号可以被归类为单词分隔符,具体取决于计数器以及空格、换行符等...
来自 WC 的维基百科文章的一些信息 http://en.wikipedia.org/wiki/Word_count
You'll get different results depending on what the author of the WC has decided to be a 'word'. Certain types of punctuation could be classed as a word seperator depending on the counter as a well as whitespaces, newlines etc...
Some info from the wikipedia article on WC http://en.wikipedia.org/wiki/Word_count