我查看了描述的智能网络算法(第55页)一个有趣的算法 - 称为 DocRank - 用于为商业文档(即没有链接(如 PDF、MS Word 文档等)。简而言之,它分析集合中每个文档之间的术语频率交集。
其他人是否可以识别其他地方描述的有趣算法,或者想要在这里分享一些新颖的算法,以应用于这些类型的文档以改进搜索结果?
请放弃涉及点击跟踪或其他不有关分析实际文档的操作的答案。
I've looked at Algorithms of the Intelligent Web that describes (page 55) an interesting algorithm - called DocRank - for creating a PageRank like score for business documents (i.e. documents without links like PDF, MS Word documents, etc...). In short it analyzes term frequency intersection between each document in a collection.
Can anyone else identify interesting algorithms described elsewhere, or wants to share something novel here, to apply against these types of documents to improve search results?
Please forgo answers involving things like click tracking or other actions NOT about analyzing the actual documents.
发布评论
评论(4)
第一种技术:逐步相似性
我可以提供一个例子——我实际上已经针对真实数据进行了测试/验证。如果您要收集多种技术并沿两个轴对它们进行排名 - 固有的复杂性或易于实施以及性能(分辨率或预测准确性),则该技术在第一个轴上会很高,并且在接近中间的位置第二;这是一种简单而有效的技术,但与最先进的技术相比可能表现不佳。
我们发现低频关键字交集与文档读者/查看者之间的相似性相结合是对文档的相当强的预测因素文档的内容。换句话说:如果两个文档具有一组相似的非常低频术语(例如,特定于领域的术语,如“决策流形”等)并且它们具有相似的入站流量概况,则该组合强烈证明相似性的文件。
相关详细信息:
第一个过滤器:低频术语。我们解析了大量文档以获得每个文档的术语频率。我们使用这个词频频谱作为“指纹”,这是常见的,但我们应用了逆加权,因此常见术语(“a”、“of”、“the”)在相似性度量中只占很少的数量,而罕见术语很重要(这很常见,您可能知道)。
试图以此为基础来确定两个文档是否相似是有问题的;例如,两个文档可能共享与 MMO 相关的罕见术语列表,但这些文档仍然不相似,因为一个文档针对玩 MMO,另一个文档针对设计它们。
第二个过滤器:读者。显然我们不知道谁阅读了这些文档,因此我们从流量来源推断读者人数。您可以在上面的示例中看到这有什么帮助。 MMO 播放器网站/文档的入站流量反映了内容,同样适用于针对 MMO 设计的文档。
第二种技术:核主成分分析 (kPCA)
kPCA 是无监督技术(在数据之前从数据中删除类标签)被传入)。该技术的核心只是基于特征向量的矩阵分解(在本例中为协方差矩阵)。该技术通过核技巧处理非线性,核技巧只是将数据映射到更高维的特征空间,然后在该空间中执行 PCA。在 Python/NumPy/SciPy 中大约有 25 行代码。
这些数据是从文学作品的非常简单的文本解析中收集的——特别是这四位作家的大部分已出版作品:莎士比亚、简·奥斯汀、杰克·伦敦、弥尔顿。 (我相信,尽管我不确定,普通大学生所参加的课程中,他们被分配阅读这些作者的小说。)
该数据集在机器学习中广泛使用,并且可以从网络上的许多地方获得。
所以这些作品被分为872篇(大致相当于小说的章节);换句话说,四位作者各有大约 220 篇不同的实质性文本。
接下来对合并的语料库文本进行词频扫描,并选择70个最常见单词进行研究,频率扫描的其余结果被丢弃。
这 70 个单词是:
这些成为字段(列)名称。最后,准备了对应于 872 个文本的一个数据行(来自截断的词频扫描)。以下是这些数据点之一:
总而言之,数据由 70 个维度组成(每个维度是这四位作者之一的给定文本中特定单词的频率或总数。
同样,尽管此数据主要用于对于监督分类(类标签的存在是有原因的),我使用的技术是无监督——换句话说,我从未向算法显示类标签。 kPCA 算法完全不知道这四个不同的簇(如下图所示)对应什么,也不知道每个簇之间有何不同——该算法甚至不知道我的数据由多少个组(类)组成。给它数据,它根据固有的顺序将其非常整齐地划分为四个不同的组:
同样,我在这里使用的算法是 kPCA。使用 Python、NumPy 和 Matplotlib,生成这些结果的脚本大约有 80 行代码——用于 IO、数据处理、应用 kPCA 和绘制结果。
不多,但对于一个 SO 帖子来说太多了。无论如何,任何想要此代码的人都可以从我的存储库中获取它。同时,每个 python 包中还有一个用 python + numpy 编码的完整的、记录良好的 kPCA 算法(全部可从 mloss.org):shogun(“大规模机器学习工具箱”),“sdpy (一组针对计算机视觉和机器学习的模块)和 mlpy(“PYthon 中的机器学习”)。
First Technique: step-wise similarity
I can offer one example--that i've actually tested/validated against real data. If you were to gather a number of techniques and rank them along two axes--inherent complexity or ease of implementation AND performance (resolution, or predictive accuracy), this technique would be high on the first axis, and somewhere near the middle on the second; a simple and effective technique, but which might underperform against state-of-the-art techniques.
We found that the combination of low-frequency keyword intersection combined with similarity among readers/viewers of a document, is a fairly strong predictor of the document's content. Put another way: if two documents have the a similar set of very low-frequency terms (e.g., domain-specific terms, like 'decision manifold', etc.) and they have similar inbound traffic profiles, that combination is strongly probative of similarity of the documents.
The relevant details:
first filter: low-frequency terms. We parsed a large set of documents to get the term frequency for each. We used this word frequency spectrum as a 'fingerprint', which is common, but we applied an inverse weighting, so that the common terms ('a', 'of', 'the') counted very little in the similarity measure, whereas the rare terms counted a lot (this is quite common, as you probably know).
Trying to determine whether two documents were similar based on this along was problematic; for instance, two documents might share a list of rare terms relating to MMOs, and still the documents weren't similar because one is directed to playing MMOs and the other to designing them.
second filter: readers. Obviously we don't know who had read these documents, so we inferred readership from traffic sources. You can see how that helped in the example above. The inbound traffic for the MMO player site/document reflected the content, likewise for the document directed to MMO design.
Second Technique: kernel Principal Component Analysis (kPCA)
kPCA is unsupervised technique (the class labels are removed from the data before the data is passed in). At the heart of the technique is just an eigenvector-based decomposition of a matrix (in this case a covariance matrix). This technique handles non-linearity via the kernel trick, which just maps the data to a higher dimensional features space then performs the PCA in that space. In Python/NumPy/SciPy it is about 25 lines of code.
The data is collected from very simple text parsing of literary works--in particular, most of the published works of these four authors: Shakespeare, Jane Austen, Jack London, Milton. (I believe, though i'm not certain, that normal college students take course in which they are assigned to read novels by these authors.)
This dataset is widely used in ML and is available from a number of places on the Web.
So these works were divided into 872 pieces (corresponding roughly to chapters in novels); in other words, about 220 different substantial pieces of text for each of the four authors.
Next a word-frequency scan was performed on the combined corpus text, and the 70 most common words were selected for the study, the remainder of the results of the frequency scan were discarded.
Those 70 words were:
These became the field (column) names. Finally, one data row corresponding to the 872 texts was prepared (from the truncated word frequency scan). Here's one of those data points:
In sum, the data is comprised of 70 dimensions (each dimension is the frequency or total count of a particular word, in a given text of one of these four authors.
Again, although this data is primarily used for supervised classification (the class labels are there for a reason), the technique i used was unsupervised--put another way, i never showed the class labels to the algorithm. The kPCA algorithm has absolutely no idea what those four different clusters (shown in the plot below) correspond to nor how each differs from the other--the algorithm did not even know how many groups (classes) that data was comprised of. I just gave it the data, and it partitioned it very neatly into four distinct groups based on an inherent ordering.
The results:
Again, the algorithm i used here is kPCA. Using Python, NumPy, and Matplotlib, the script that produced these results is about 80 lines of code--for the IO, data processing, applying kPCA, and plotting the result.
Not much, but too much for an SO post. In any event, anyone who wants this code can get it from my repo. At the same time, there is also a complete, well-documented kPCA algorithm coded in python + numpy in each of these python packages (all available from mloss.org): shogun ('A Large Scale Machine Learning Toolbox'), 'sdpy (a set of modules directed to computer vision and machine learning), and mlpy ('Machine Learning in PYthon').
另一个有趣的算法 - TextRank - 听起来与 DocRank 非常相似原始问题中引用。基于图、无监督、迭代直至收敛。
Java 实现。
Another interesting algorithm - TextRank - sounds very similar to DocRank referenced in original question. Graph based, unsupervised, iterate until convergence.
Java implementation.
我对此主题做了一些额外的研究,并找到了 Okapi BM25 算法的维基百科条目。它还有一个考虑文档结构的后继 BM25F,但这似乎与 HTML/XML 更相关。
BM25 包含:
的长度最后,维基百科条目链接到 Lucene 实现。
与上面 @Doug 的答案相比,这似乎是一个更复杂的算法来实现。
I've done some additional research on the topic and found the Wikipedia entry for the Okapi BM25 algorithm. It also has a successor BM25F that takes document structure into account, but this appears to be more relevant to HTML/XML.
BM25 Incorporates:
Finally, the Wikipedia entry links to a Lucene implementation.
Compared to @Doug's answers above this appears to be a more complex algorithm to implement.
这里有一些排名算法,尽管我还没有看到任何实现。
Here are some algorithms for ranking, though I haven't seen any implementations yet.