一串英文文本的熵如何表示低质量?
Jeff Atwood 最近在 Twitter 上发布了一个 CodeReview 帖子的链接,他想知道社区是否可以改进他的“
他的方法的要点似乎是,如果你计算唯一字符的数量在字符串中,表示熵(代码取自 PieterG 的回答):
int uniqueCharacterCount = string.Distinct().Count();
我不明白唯一字符数如何表示字符串的熵,以及字符串的熵如何表示低质量。我想知道在这方面有更多知识的人是否可以解释阿特伍德先生想要实现的目标。
谢谢!
Jeff Atwood recently tweeted a link to a CodeReview post where he wanted to know if the community could improve his "calculating entropy of a string" code snippet. He explained, "We're calculating entropy of a string a few places in Stack Overflow as a signifier of low quality."
The gist of his method seemed to be that if you count the number of unique characters in a string, that signifies entropy (code taken from PieterG's answer):
int uniqueCharacterCount = string.Distinct().Count();
I don't understand how the unique character count signifies entropy of a string, and how the entropy of a string signifies low quality. I was wondering if someone with more knowledge in this area could explain what Mr. Atwood is trying to accomplish.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
造成混乱的原因似乎是认为这是用来阻止帖子发布的,但事实并非如此。
它只是用于查找可能的低质量帖子的几种算法之一,显示在低质量帖子选项卡 < em>(需要 10k 代表) 主持人工具。真正的人类仍然需要看看这个帖子。
这个想法是捕捉像
~~~~~~No.~~~~~~
或FUUUUUUUU------
这样的帖子,而不是捕捉 所有低质量的帖子。至于“唯一的字符数如何表示熵?” - 事实并非如此。点赞最多的答案完全没有抓住重点。
请参阅 https://codereview.stackexchange.com/questions/868#878 和 https://codereview.stackexchange.com/questions/868#926
The confusion seems to be from the idea that this is used to block posts from being posted - it's not.
It is just one of several algorithms used to find possible low-quality posts, displayed on the low quality posts tab (requires 10k rep) of the moderator tools. Actual humans still need to look at the post.
The idea is to catch posts like
~~~~~~No.~~~~~~
orFUUUUUUUU------
, not to catch all low-quality posts.As for "How does the unique character-count signify entropy?" - it doesn't, really. The most upvoted answers completely miss the point.
See https://codereview.stackexchange.com/questions/868#878 and https://codereview.stackexchange.com/questions/868#926
字符串“aaaaaaaaaaaaaaaaaaaaaaaaaa”的熵非常低,并且毫无意义。
字符串“blah blah blah blah blah blah blah blah”的熵有点高,但仍然相当愚蠢,可以是 攻击的一部分。
具有与这些字符串相当的熵的帖子或评论可能是不合适的;它不能包含任何有意义的消息,甚至是垃圾邮件链接。这样的帖子可能会被过滤掉或需要额外的验证码。
String 'aaaaaaaaaaaaaaaaaaaaaaaaaaa' has very low entropy, and is rather meaningless.
String 'blah blah blah blah blah blah blah blah' has a bit higher entropy, but is still rather silly and can be a part of an attack.
A post or a comment that has entropy comparable to these strings is probably not appropriate; it can't contain any meaningful message, even a spam link. Such a post can be just filtered out or warrant an additional captcha.
让我们看一下维基百科关于熵(信息论):
特别是英语信息:
换句话说,并不是简单地认为低熵是坏的,高熵是好的,反之亦然 - 存在一个最佳熵范围。
Let's look at the Wikipedia entry on Entropy (information theory):
And specifically with English information:
In other words, it's not simply that low entropy is bad and high entropy is good, or vice versa - there is an optimal entropy range.
香农熵 H(P) 是随机变量 X 的概率分布 P 的属性。
对于字符串,处理字符串的基本方法是将其视为一袋字符。在这种情况下,频率计数提供了字符串中随机选择的字符的概率分布 P 的近似值。
如果我们简单地计算字符串中唯一字符的数量,这将与该字符串中出现的唯一字符数量的均匀分布的熵相关。唯一字符的数量越多,熵就越大。
然而,Jeff Atwood(和 BlueRaja)随后的代码贡献是更好的衡量标准,因为他们考虑了字符串的其他可能的分布;仍然被认为是一袋(不一定是唯一的)字符;代表。
以 Rex M 的答案为基础……寻找“字符熵”超出 1.0 - 1.5 范围的字符串(可能是“低质量字符串”)会更有意义。
The Shannon Entropy H(P) is the property of a probability distribution P, of a random variable X.
In the case of a string, a rudimentary way of treating it is as a bag of characters. In which case, the frequency count provides an approximation of the probability distribution P, of a randomly chosen character in the string.
If we were to simply count the number of unique characters in a string, this would correlate with the entropy of the uniform distribution of the number of unique characters that appear in that string. And the greater the number of unique characters, the greater would be the entropy.
However, Jeff Atwood (and BlueRaja's) subsequent code contributions are better measures, as they take into account the other possible distributions that a string; still thought of as a bag of (not necessarily unique) characters; represents.
Building on Rex M's answer ... it would make more sense to look for strings where the 'character entropy' fell outside the 1.0 - 1.5 range, as possible 'low quality strings.'
不完全是您问题的答案,但是,维基百科有对熵的解释:
Not Exactly an answer to your question but, Wikipedia has this explanation of Entropy: