To aid your web-search, this discipline is often called Stylometry (and occasionally, Stylogenetics).
So the two most important questions are i suppose: which classifiers are useful for this purpose and what data is fed to the classifier?
What i still find surprising is how little data is required to achieve very accurate classification. Often the data is just a word frequency list. (A directory of word frequency lists is available online here.)
For instance, one data set widely used in Machine Learning and available from a number of places on the Web, is comprised of data from four authors: Shakespeare, Jane Austen, Jack London, Milton. these works were divided into 872 pieces (corresponding roughly to chapters), in other words, about 220 different substantial pieces of text for each of the four authors; each of these pieces becomes a single data point in the data set. Next a word-frequency scan was performed on each text, and the 70 most common words were used for the study, the remainder of the results of the frequency scan were discarded. Here are the first 20 of that 70-word list.
Each of these data points is one instance of the author's literary fingerprint.
The final item in each data point is an integer (1-4) representing one of the four authors to whom that text belongs.
Recently, I ran this dataset through a simple unsupervised ML algorithm; the results were very good--almost complete separation of the four classes, which you can see in my Answer to a previous Q on StackOverflow related to text classification using ML generally, rather than author identification.
So what other algorithms are used? Apparently, most Machine Learning algorithms in the supervised category can successfully resolve this kind of data. Among these, multi-layer perceptrons (MLP, aka, neural networks) are often used (Author Attribution Using Neural Networks is one such frequently-cited study).
You might start with a visit to the Apache Mahout web site. There is a giant literature on classification and clustering. Essentially, you want to run a clustering algorithm, and then hope that 'which writer' determines the clusters.
发布评论
评论(2)
这绝对是可能的,而且根据文本或文本的某些部分识别作者的成功记录确实令人印象深刻。
一些代表性研究(警告:链接为 pdf 文件):
风格发生学:基于聚类的文学冠冕风格分析
为了帮助您进行网络搜索,该学科通常称为风格测量(有时,风格发生)。
因此,我认为两个最重要的问题是:哪些分类器可用于此目的以及哪些数据被输入到分类器?
我仍然感到惊讶的是,实现非常准确的分类只需要很少的数据。通常,数据只是一个词频列表。 (词频列表目录可在线获取此处。)
例如,一个该数据集广泛用于机器学习,可从网络上的许多地方获取,由四位作者的数据组成:莎士比亚、简·奥斯汀、杰克·伦敦、米尔顿。这些作品被分为 872 篇(大致对应于章节),换句话说,四位作者每人大约有 220 篇不同的实质性文本;这些片段中的每一个都成为数据集中的单个数据点。接下来对每个文本进行词频扫描,并使用 70 个最常见的单词进行研究,其余的频率扫描结果被丢弃。以下是 70 个单词列表中的前 20 个单词。
每个数据点只是 872 章中每章 70 个单词中每个单词的计数。
这些数据点中的每一个都是作者文学指纹的一个实例。
每个数据点中的最后一项是一个整数 (1-4),表示该文本所属的四位作者之一。
最近,我通过一个简单的无监督机器学习算法运行了这个数据集;结果非常好——四个类几乎完全分离,你可以在我的 回答 StackOverflow 上的一个问题,涉及一般使用 ML 进行文本分类,而不是作者识别。
那么还使用了哪些其他算法呢?显然,监督类别中的大多数机器学习算法都可以成功解析此类数据。其中,经常使用多层感知器(MLP,又名神经网络)(< em>使用神经网络进行作者归因就是这样一项经常被引用的研究)。
Absolutely it is possible, and indeed the record of success in identifying an author given a text or some portion of it, is impressive.
A couple of representative studies (warning: links are to pdf files):
Quantitative Analysis of Literary Styles
Stylogenetics: Clustering-based stylistic analysis of literary coroora
To aid your web-search, this discipline is often called Stylometry (and occasionally, Stylogenetics).
So the two most important questions are i suppose: which classifiers are useful for this purpose and what data is fed to the classifier?
What i still find surprising is how little data is required to achieve very accurate classification. Often the data is just a word frequency list. (A directory of word frequency lists is available online here.)
For instance, one data set widely used in Machine Learning and available from a number of places on the Web, is comprised of data from four authors: Shakespeare, Jane Austen, Jack London, Milton. these works were divided into 872 pieces (corresponding roughly to chapters), in other words, about 220 different substantial pieces of text for each of the four authors; each of these pieces becomes a single data point in the data set. Next a word-frequency scan was performed on each text, and the 70 most common words were used for the study, the remainder of the results of the frequency scan were discarded. Here are the first 20 of that 70-word list.
Each data point then is just a count of each word of the 70 words in each of the 872 chapters.
Each of these data points is one instance of the author's literary fingerprint.
The final item in each data point is an integer (1-4) representing one of the four authors to whom that text belongs.
Recently, I ran this dataset through a simple unsupervised ML algorithm; the results were very good--almost complete separation of the four classes, which you can see in my Answer to a previous Q on StackOverflow related to text classification using ML generally, rather than author identification.
So what other algorithms are used? Apparently, most Machine Learning algorithms in the supervised category can successfully resolve this kind of data. Among these, multi-layer perceptrons (MLP, aka, neural networks) are often used (Author Attribution Using Neural Networks is one such frequently-cited study).
您可以首先访问 Apache Mahout 网站。关于分类和聚类有大量文献。本质上,您想要运行聚类算法,然后希望“哪个作者”确定聚类。
You might start with a visit to the Apache Mahout web site. There is a giant literature on classification and clustering. Essentially, you want to run a clustering algorithm, and then hope that 'which writer' determines the clusters.