一串英文文本的熵如何表示低质量?

发布于 2024-10-18 18:00:37 字数 403 浏览 3 评论 0原文

Jeff Atwood 最近在 Twitter 上发布了一个 CodeReview 帖子的链接,他想知道社区是否可以改进他的“

他的方法的要点似乎是,如果你计算唯一字符的数量在字符串中,表示熵(代码取自 PieterG 的回答):

int uniqueCharacterCount = string.Distinct().Count();

我不明白唯一字符数如何表示字符串的熵,以及字符串的熵如何表示低质量。我想知道在这方面有更多知识的人是否可以解释阿特伍德先生想要实现的目标。

谢谢!

Jeff Atwood recently tweeted a link to a CodeReview post where he wanted to know if the community could improve his "calculating entropy of a string" code snippet. He explained, "We're calculating entropy of a string a few places in Stack Overflow as a signifier of low quality."

The gist of his method seemed to be that if you count the number of unique characters in a string, that signifies entropy (code taken from PieterG's answer):

int uniqueCharacterCount = string.Distinct().Count();

I don't understand how the unique character count signifies entropy of a string, and how the entropy of a string signifies low quality. I was wondering if someone with more knowledge in this area could explain what Mr. Atwood is trying to accomplish.

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

转瞬即逝 2024-10-25 18:00:37

造成混乱的原因似乎是认为这是用来阻止帖子发布的,但事实并非如此。

它只是用于查找可能的低质量帖子的几种算法之一,显示在低质量帖子选项卡 < em>(需要 10k 代表) 主持人工具。真正的人类仍然需要看看这个帖子。

这个想法是捕捉像 ~~~~~~No.~~~~~~FUUUUUUUU------ 这样的帖子,而不是捕捉 所有低质量的帖子。


至于“唯一的字符数如何表示熵?” - 事实并非如此。点赞最多的答案完全没有抓住重点。

请参阅 https://codereview.stackexchange.com/questions/868#878https://codereview.stackexchange.com/questions/868#926

The confusion seems to be from the idea that this is used to block posts from being posted - it's not.

It is just one of several algorithms used to find possible low-quality posts, displayed on the low quality posts tab (requires 10k rep) of the moderator tools. Actual humans still need to look at the post.

The idea is to catch posts like ~~~~~~No.~~~~~~ or FUUUUUUUU------, not to catch all low-quality posts.


As for "How does the unique character-count signify entropy?" - it doesn't, really. The most upvoted answers completely miss the point.

See https://codereview.stackexchange.com/questions/868#878 and https://codereview.stackexchange.com/questions/868#926

面如桃花 2024-10-25 18:00:37

字符串“aaaaaaaaaaaaaaaaaaaaaaaaaa”的熵非常低,并且毫无意义。

字符串“blah blah blah blah blah blah blah blah”的熵有点高,但仍然相当愚蠢,可以是 攻击的一部分

具有与这些字符串相当的熵的帖子或评论可能是不合适的;它不能包含任何有意义的消息,甚至是垃圾邮件链接。这样的帖子可能会被过滤掉或需要额外的验证码。

String 'aaaaaaaaaaaaaaaaaaaaaaaaaaa' has very low entropy, and is rather meaningless.

String 'blah blah blah blah blah blah blah blah' has a bit higher entropy, but is still rather silly and can be a part of an attack.

A post or a comment that has entropy comparable to these strings is probably not appropriate; it can't contain any meaningful message, even a spam link. Such a post can be just filtered out or warrant an additional captcha.

夏有森光若流苏 2024-10-25 18:00:37

让我们看一下维基百科关于熵(信息论)

在信息论中,熵是与随机变量相关的不确定性的度量。在这种情况下,该术语通常指香农熵,它量化消息中包含的信息的预期值...

特别是英语信息:

根据香农基于人体实验的估计,英文文本的熵率在每个字母 1.0 到 1.5 位之间,或者低至每个字母 0.6 到 1.3 位。

换句话说,并不是简单地认为低熵是坏的,高熵是好的,反之亦然 - 存在一个最佳熵范围

Let's look at the Wikipedia entry on Entropy (information theory):

In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message...

And specifically with English information:

The entropy rate of English text is between 1.0 and 1.5 bits per letter, or as low as 0.6 to 1.3 bits per letter, according to estimates by Shannon based on human experiments.

In other words, it's not simply that low entropy is bad and high entropy is good, or vice versa - there is an optimal entropy range.

桃扇骨 2024-10-25 18:00:37

香农熵 H(P) 是随机变量 X 的概率分布 P 的属性。

对于字符串,处理字符串的基本方法是将其视为一袋字符。在这种情况下,频率计数提供了字符串中随机选择的字符的概率分布 P 的近似值。

如果我们简单地计算字符串中唯一字符的数量,这将与该字符串中出现的唯一字符数量的均匀分布的熵相关。唯一字符的数量越多,熵就越大。

然而,Jeff Atwood(和 BlueRaja)随后的代码贡献是更好的衡量标准,因为他们考虑了字符串的其他可能的分布;仍然被认为是一袋(不一定是唯一的)字符;代表。

以 Rex M 的答案为基础……寻找“字符熵”超出 1.0 - 1.5 范围的字符串(可能是“低质量字符串”)会更有意义。

The Shannon Entropy H(P) is the property of a probability distribution P, of a random variable X.

In the case of a string, a rudimentary way of treating it is as a bag of characters. In which case, the frequency count provides an approximation of the probability distribution P, of a randomly chosen character in the string.

If we were to simply count the number of unique characters in a string, this would correlate with the entropy of the uniform distribution of the number of unique characters that appear in that string. And the greater the number of unique characters, the greater would be the entropy.

However, Jeff Atwood (and BlueRaja's) subsequent code contributions are better measures, as they take into account the other possible distributions that a string; still thought of as a bag of (not necessarily unique) characters; represents.

Building on Rex M's answer ... it would make more sense to look for strings where the 'character entropy' fell outside the 1.0 - 1.5 range, as possible 'low quality strings.'

〆凄凉。 2024-10-25 18:00:37

不完全是您问题的答案,但是,维基百科有对熵的解释:

熵是对无序性的度量,或更准确地说是对不可预测性的度量。
例如,用一枚公平的硬币进行的一系列抛硬币具有最大熵,
因为无法预测接下来会发生什么。一串硬币
抛掷两头硬币的熵为零,因为硬币总是
抬起头来。现实世界中的大多数数据集合都位于某个地方
介于两者之间。

英文文本的熵相当低。换句话说,这是相当可预测的。
即使我们不知道接下来会发生什么,我们也可以公平地
例如,可以肯定,e 的数量会比 z 的数量多得多,或者
组合“qu”将比任何其他组合更常见
其中包含“q”,并且“th”的组合比任何组合都更常见
其中。未压缩的英文文本大约有一位熵
消息的每个字节(八位)。

Not Exactly an answer to your question but, Wikipedia has this explanation of Entropy:

Entropy is a measure of disorder, or more precisely unpredictability.
For example, a series of coin tosses with a fair coin has maximum entropy,
since there is no way to predict what will come next. A string of coin
tosses with a two-headed coin has zero entropy, since the coin will always
come up heads. Most collections of data in the real world lie somewhere
in between.

English text has fairly low entropy. In other words, it is fairly predictable.
Even if we don't know exactly what is going to come next, we can be fairly
certain that, for example, there will be many more e's than z's, or that
the combination 'qu' will be much more common than any other combination
with a 'q' in it and the combination 'th' will be more common than any
of them. Uncompressed, English text has about one bit of entropy for
each byte (eight bits) of message.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文