考虑 QWERTY 键盘布局检测随机键盘敲击
最近维基百科破坏行为检测竞赛的获胜者表示,可以通过“检测随机键盘”来改进检测考虑 QWERTY 的点击率 键盘布局”。
示例:woijf qoeoifwjf oiiwjf oiwj pfowjfoiwjfo oiwjfoewoh
是否有任何软件已经做到了这一点(最好免费和开源)?
如果没有,是否存在一个活跃的 FOSS 项目,其目标 ?
如果没有,您建议如何实现这样的软件
The winner of a recent Wikipedia vandalism detection competition suggests that detection could be improved by "detecting random keyboard hits considering QWERTY
keyboard layout".
Example: woijf qoeoifwjf oiiwjf oiwj pfowjfoiwjfo oiwjfoewoh
Is there any software that does this already (preferably free and open source) ?
If not, is there an active FOSS project whose goal is to achieve this?
If not, how would you suggest to implement such a software?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果分析文本中的两个 bigrams 在 QWERTY 术语中接近,但在英语语言中统计频率接近零(例如“fg”或“cd”对),那么就有可能涉及随机键盘敲击。如果发现更多这样的对,那么机会就会大大增加。
如果您想考虑使用双手进行攻击,则测试与另一个字母分隔的字母的 QWERTY 接近度,但测试两个二元组(甚至三元组)的二元组频率。例如,在文本“flsjf”中,您将检查 F 和 S 的 QWERTY 距离,但检查二元组 FL 和 LS(或三元组 FLS)的频率。
If two bigrams in analyzed text are close in QWERTY terms but have near zero statistical frequency in English language (like pairs "fg" or "cd") then there is chance that random keyboard hits are involved. If more such pairs are found then chance increases greatly.
If you want to take into account the use of both hands for bashing then test letters that are separated with another letter for QWERTY closeness, but two bigrams (or even trigrams) for bigram frequency. For example in text "flsjf" you would check F and S for QWERTY distance, but bigrams FL and LS (or trigram FLS) for frequency.
考虑两个字母序列的经验分布,即“给定字母 a 跟在字母 b 后面的概率”,所有这些概率都会填充一个大小为 27x27 的表格(将空间视为字母)。
现在,将其与一堆英语/法语/其他文本的历史数据进行比较。使用 Kullback 散度进行比较。
Consider empirical distribution of sequences of two letters, ie "probability of having letter a given it follows letter b", all this probabilities fill a table of size 27x27 (considering space as a letter).
Now, compare this with historical data from a bunch of english/french/whatever texts. Use Kullback divergence for comparison.
根据我的经验,大多数键盘混搭往往都在主排。检查所使用的大部分字符是否是
asdfjkl;
相当简单。Most keyboard mashing tends to be on the home row in my experience. It would be reasonably simple to check to see if a high proportion of the characters used are
asdfjkl;
.采用基于键盘布局的方法将提供一个很好的指标。使用 QWERTY 布局,您会发现任何给定文本中大约 52% 的字母都来自键盘字符的顶行。大约 32% 的字符来自中线,14% 的字符来自底线。虽然从一种语言到另一种语言略有不同,但仍然存在可以检测到的非常清晰的模式。使用相同的方法来发现其他键盘布局中的模式,然后确保在检查乱码之前检测用于输入的任何文本的布局。尽管模式很明确,但最好仅将此方法用作一个指标,因为这种方法最适合较长的脚本。使用其他指示符(例如与字母/数字混合的非字母/数字字符、文本长度等)将提供进一步的指示符,这些指示符在应用加权时可以提供非常好的乱码条目的整体指示。
Taking an approach based on keyboard layout will provide a good indicator. With a QWERTY layout you will find that around 52% of letters in any given text will be from the top line of keyboard characters. About 32% of characters will be from the middle line and 14% of will be from bottom line. While this varies slightly from one language to another, there remains a very clear pattern which can be detected. Use the same methodology to discover patterns in other keyboard layouts, then ensure you detect the layout used for any text entered before checking for gibberish. Even though the pattern is clear, it is best to use this method as one indicator only given that this methodology works best with longer scripts. Using other indicators such as non-alpha/numeric characters mixed with alpha/numeric, text length etc will provide further indicators which when applying weighting, can provide a pretty good overall indication of gibberish entry.
弗莱德利的答案可以扩展到从附近的字母构造单词的语法。
例如,可以使用连接
as
、sa
、sd
和df< 的语法生成
asasasasasdf
。 /代码>。使用这样的语法,扩展到键盘上的所有字母(字母彼此相邻)可以在解析后为您提供使用这种“乱码”语法可以生成多少文本的度量。
警告:当然,任何讨论此类语法并列出“乱码”文本示例的文本都会比常规拼写检查文本得分高得多。
请注意,示例方法不会捕获“h4x0rrulezzzzz!!!!!”形式的破坏行为。
这里的另一种方法(可以与上述方法集成)是对被破坏的文本的语料库进行统计分析,并尝试获取被破坏的文本中的常见单词。
编辑:
既然你假设是 QWERTY,我想我们也可以假设是英语?
KISS 怎么样 - 通过英语拼写检查器运行文本,如果它失败了,那么就得出结论,它可能是乱码(问题是,为什么要区分快速键入的乱码和随机的废话,或者就此而言与拼写非常糟糕的文本?
)需要考虑其他键盘布局(德沃夏克,任何人?)和语言,然后可能通过所有可用的语言拼写检查器运行文本,然后继续(这也将提供语言自动检测)。
这不是非常有效的方法,但可以用作基线测试。
注:
从长远来看,我认为破坏者会适应并开始破坏,例如其他维基百科页面的摘录,这最终很难自动检测为破坏行为(好吧,可以对现有文本进行校验和并在重复项上提出标志,但如果文本来自其他来源,最终会很困难)。
Fredley's answer can be extended to a grammar that would construct words from nearby letters.
For example
asasasasasdf
could be generated with a grammar that connectsas
,sa
,sd
anddf
.With such grammar, expanded to all letters on the keyboard (with letters that are next to each other) could, after parsing, give you a measure of how much of a text can be generated with this 'gibberish' grammar.
Caveat: of course, any text discussing such grammar and listing examples of 'gibberish' text would score significantly higher then a regular spell-checked text.
Do note that the example approach would not catch vandalism in the form of 'h4x0r rulezzzzz!!!!!'.
Another approach here (which can be integrated with the above method) would be to statistically analyze a corpus of vandalized text and try to get common words in vandalized texts.
EDIT:
Since you are assuming QWERTY, I guess we could assume English, too?
What about KISS - run the text through english spell checker and if it fails miserably conclude that it is probably gibberish (the question is, why want to distinguish quickly typed gibberish from random nonsense or for that matter from very badly spelled text?)
Alternatively if other keyboard layouts (Dvorak, anyone?) and languages are to be considered, then maybe run the text through all available language spell checkers and then proceed (this would give language autodetect, too).
This would not be very efficient method, but could be used as a baseline test.
Note:
In the long run I imagine that vandals would adapt and start vandalizing with, for example excerpts from other wikipedia pages, which would be ultimately hard to automatically detect as vandalism (ok, existing texts could be checksummed and flag raised on duplicates, but if text came from some other source it would be ultimately hard).