Markov chains are a (almost standard) way to generate random gibberish which looks intelligent to untrained eye. How would you go about identifying markov generated text from human written text.
It would be awesome if the resources you point to are Python friendly.
One simple approach would be to have a large group of humans read input text for you and see if the text makes sense. I'm only half-joking, this is a tricky problem.
I believe this to be a hard problem, because Markov-chain generated text is going to have a lot of the same properties of real human text in terms of word frequency and simple relationships between the ordering of words.
The differences between real text and text generated by a Markov chain are in higher-level rules of grammar and in semantic meaning, which are hard to encode programmatically. The other problem is that Markov chains are good enough at generating text that they sometimes come up with grammatically and semantically correct statements.
Today, he would feel convinced that
the human will is free; to-morrow,
considering the indissoluble chain of
nature, he would look on freedom as a
mere illusion and declare nature to be
all-in-all.
While this string was written by a computer program, it's hard to say that a human would never say this.
I think that unless you can give us more specific details about the computer and human-generated text that expose more obvious differences it will be difficult to solve this using computer programming.
You could use a "brute force" approach, whereby you compare the generated language to data collected on n-grams of higher order than the Markov model that generated it.
i.e. If the language was generated with a 2nd order Markov model, up to 3-grams are going to have the correct frequencies, but 4-grams probably won't.
You can get up to 5-gram frequencies from Google's public n-gram dataset. It's huge though - 24G compressed - you need to get it by post on DVDs from LDC.
EDIT: Added some implementation details
The n-grams have already been counted, so you just need to store the counts (or frequencies) in a way that's quick to search. A properly indexed database, or perhaps a Lucene index should work.
Given a piece of text, scan across it and look up the frequency of each 5-gram in your database, and see where it ranks compared to other 5-grams that start with the same 4 words.
Practically, a bigger obstacle might be the licensing terms of the dataset. Using it for a commercial app might be prohibited.
I suggest a generalization of Evan's answer: make a Markov model of your own and train it with a big chunk of the (very large) sample you're given, reserving the rest of the sample as "test data". Now, see how well the model you've trained does on the test data, e.g. with a chi square test that will suggest situation in which "fit is TOO good" (suggesting the test data is indeed generated by this model) as well as ones in which the fit is very bad (suggesting error in model structure -- an over-trained model with the wrong structure does a notoriously bad job in such cases).
Of course there are still many issues for calibration, such as the structure of the model -- are you suspecting a simple model based on Ntuples of words and little more, or a more sophisticated one with grammar states and the like. Fortunately you can calibrate things pretty well by using large corpora of known-to-be-natural text and also ones you generate yourself with models of various structures.
A different approach is to use nltk to parse the sentences you're given -- a small number of mis-parses is to be expected even in natural text (as humans are imperfect and so is the parser -- it may not know that word X can be used as a verb and only classify it as a noun, etc, etc), but most Markov models (unless they're modeling essentially the same grammar structure your parser happens to be using -- and you can use several parsers to try and counteract that!-) will cause VASTLY more mis-parses than even dyslexic humans. Again, calibrate that on natural vs synthetic texts, and you'll see what I mean!-)
If you had several large Markov-generated texts, you could possibly determine that they were so by comparing the word frequencies between each of the samples. Since Markov chains depend on constant word probabilities, the proportions of any given word should be roughly equal from sample to sample.
If you write a program which generates Markovian transition probabilities from any sequence of symbols, and then calculates the entropy rate of the markov matrix. (see http://en.wikipedia.org/wiki/Entropy_rate#Entropy_rates_for_Markov_chains) This is basically an estimate of how easily the text could be predicted using just the markov chain (higher entropy means harder to predict). Therefore I would think that the lower the entropy of the markov matrix is, the more likely that the sample of text is controlled by a markov matrix. If you have questions on how to write this code, I happen to have a program in python which does exactly this on my computer, so I can help you out
发布评论
评论(6)
一种简单的方法是让一大群人为您阅读输入文本,看看文本是否有意义。 我只是半开玩笑,这是一个棘手的问题。
我认为这是一个难题,因为马尔可夫链生成的文本在词频和单词顺序之间的简单关系方面将具有许多与真实人类文本相同的属性。
真实文本和马尔可夫链生成的文本之间的差异在于高级语法规则和语义含义,这些差异很难以编程方式编码。 另一个问题是马尔可夫链在生成文本方面足够好,有时它们会提出语法和语义上正确的语句。
举个例子,这里有一个来自 kantmachine 的格言:
虽然这个字符串是由计算机程序编写的,但很难说人类永远不会这么说。
我认为,除非您能够向我们提供有关计算机和人类生成文本的更多具体细节,以揭示更明显的差异,否则很难使用计算机编程来解决这个问题。
One simple approach would be to have a large group of humans read input text for you and see if the text makes sense. I'm only half-joking, this is a tricky problem.
I believe this to be a hard problem, because Markov-chain generated text is going to have a lot of the same properties of real human text in terms of word frequency and simple relationships between the ordering of words.
The differences between real text and text generated by a Markov chain are in higher-level rules of grammar and in semantic meaning, which are hard to encode programmatically. The other problem is that Markov chains are good enough at generating text that they sometimes come up with grammatically and semantically correct statements.
As an example, here's an aphorism from the kantmachine:
While this string was written by a computer program, it's hard to say that a human would never say this.
I think that unless you can give us more specific details about the computer and human-generated text that expose more obvious differences it will be difficult to solve this using computer programming.
您可以使用“强力”方法,将生成的语言与在比生成它的马尔可夫模型更高阶的 n 元语法上收集的数据进行比较。
即,如果语言是使用二阶马尔可夫模型生成的,则最多 3-grams 将具有正确的频率,但 4-grams 可能不会。
您可以从 Google 的公共 n-gram 数据集。 虽然它很大 - 24G 压缩 - 您需要通过从 LDC。
编辑:添加了一些实现细节
n-gram 已经被计数,因此您只需要以快速搜索的方式存储计数(或频率)。 一个正确索引的数据库,或者 Lucene 索引应该可以工作。
给定一段文本,扫描它并查找数据库中每个 5-gram 的频率,并查看与以相同 4 个单词开头的其他 5-gram 相比,它的排名。
实际上,更大的障碍可能是数据集的许可条款。 将其用于商业应用程序可能会被禁止。
You could use a "brute force" approach, whereby you compare the generated language to data collected on n-grams of higher order than the Markov model that generated it.
i.e. If the language was generated with a 2nd order Markov model, up to 3-grams are going to have the correct frequencies, but 4-grams probably won't.
You can get up to 5-gram frequencies from Google's public n-gram dataset. It's huge though - 24G compressed - you need to get it by post on DVDs from LDC.
EDIT: Added some implementation details
The n-grams have already been counted, so you just need to store the counts (or frequencies) in a way that's quick to search. A properly indexed database, or perhaps a Lucene index should work.
Given a piece of text, scan across it and look up the frequency of each 5-gram in your database, and see where it ranks compared to other 5-grams that start with the same 4 words.
Practically, a bigger obstacle might be the licensing terms of the dataset. Using it for a commercial app might be prohibited.
我建议对埃文的答案进行概括:制作一个自己的马尔可夫模型,并用给定的很大一部分(非常大)样本来训练它,将样本的其余部分保留为“测试数据”。 现在,看看您训练的模型在测试数据上的表现如何,例如使用卡方测试,该测试将建议“拟合太好”的情况(表明测试数据确实是由该模型生成的)以及拟合度非常差的模型(表明模型结构存在错误——结构错误的过度训练模型在这种情况下表现非常糟糕)。
当然,仍然存在许多校准问题,例如模型的结构——您是否怀疑是基于 N 元组的单词等的简单模型,还是具有语法状态等的更复杂的模型。 幸运的是,您可以通过使用已知自然文本的大型语料库以及您使用各种结构模型自行生成的语料库来很好地校准事物。
另一种方法是使用 nltk 来解析给定的句子 - 少量错误解析即使在自然文本中也是如此(因为人类是不完美的,解析器也是如此——它可能不知道单词 X 可以用作动词,而只能将其分类为名词等),但大多数马尔可夫模型(除非它们建模的语法结构与您的解析器恰好使用的语法结构基本相同 - 并且您可以使用多个解析器来尝试抵消它! -)将导致比阅读困难的人更多的错误解析。 再次,在自然文本与合成文本上进行校准,您就会明白我的意思!-)
I suggest a generalization of Evan's answer: make a Markov model of your own and train it with a big chunk of the (very large) sample you're given, reserving the rest of the sample as "test data". Now, see how well the model you've trained does on the test data, e.g. with a chi square test that will suggest situation in which "fit is TOO good" (suggesting the test data is indeed generated by this model) as well as ones in which the fit is very bad (suggesting error in model structure -- an over-trained model with the wrong structure does a notoriously bad job in such cases).
Of course there are still many issues for calibration, such as the structure of the model -- are you suspecting a simple model based on Ntuples of words and little more, or a more sophisticated one with grammar states and the like. Fortunately you can calibrate things pretty well by using large corpora of known-to-be-natural text and also ones you generate yourself with models of various structures.
A different approach is to use nltk to parse the sentences you're given -- a small number of mis-parses is to be expected even in natural text (as humans are imperfect and so is the parser -- it may not know that word X can be used as a verb and only classify it as a noun, etc, etc), but most Markov models (unless they're modeling essentially the same grammar structure your parser happens to be using -- and you can use several parsers to try and counteract that!-) will cause VASTLY more mis-parses than even dyslexic humans. Again, calibrate that on natural vs synthetic texts, and you'll see what I mean!-)
如果您有几个大型马尔可夫生成的文本,您可以通过比较每个样本之间的词频来确定它们是这样的。 由于马尔可夫链依赖于恒定的单词概率,因此样本之间任何给定单词的比例应该大致相等。
If you had several large Markov-generated texts, you could possibly determine that they were so by comparing the word frequencies between each of the samples. Since Markov chains depend on constant word probabilities, the proportions of any given word should be roughly equal from sample to sample.
众包。 使用 Mechanical Turk 并让一些人对此进行投票。 甚至有一些库可以帮助您实现这一目标。 例如:
这是 O' 的博客文章Reilly Radar 有关使用 Mechanical Turk 完成工作的提示:
Crowdsourcing. Use Mechanical Turk and get a number of humans to vote on this. There are even some libraries to help you pull this off. For example:
Here's a blog post from O'Reilly Radar on tips for using Mechanical Turk to get your work done:
如果您编写一个程序,从任何符号序列生成马尔可夫转移概率,然后计算马尔可夫矩阵的熵率。 (参见http://en.wikipedia.org/wiki/Entropy_rate#Entropy_rates_for_Markov_chains)基本上是对仅使用马尔可夫链预测文本的容易程度的估计(较高的熵意味着更难预测)。 因此我认为马尔可夫矩阵的熵越低,文本样本就越有可能由马尔可夫矩阵控制。 如果您对如何编写此代码有疑问,我碰巧有一个 python 程序在我的计算机上执行此操作,所以我可以帮助您
If you write a program which generates Markovian transition probabilities from any sequence of symbols, and then calculates the entropy rate of the markov matrix. (see http://en.wikipedia.org/wiki/Entropy_rate#Entropy_rates_for_Markov_chains) This is basically an estimate of how easily the text could be predicted using just the markov chain (higher entropy means harder to predict). Therefore I would think that the lower the entropy of the markov matrix is, the more likely that the sample of text is controlled by a markov matrix. If you have questions on how to write this code, I happen to have a program in python which does exactly this on my computer, so I can help you out