LSA - 潜在语义分析 - 如何用 PHP 编码?
我想在 PHP 中实现潜在语义分析 (LSA),以便找出文本的主题/标签。
这就是我认为我必须做的。 这是正确的吗? 我如何用 PHP 编写它? 如何确定选择哪些单词?
我不想使用任何外部库。 我已经实现了奇异值分解 (SVD)。
- 从给定文本中提取所有单词。
- 对单词/短语进行加权,例如使用 tf–idf。 如果加权太复杂,则只取出现次数。
- 建立一个矩阵:列是数据库中的一些文档(越多越好?),行都是唯一的单词,值是出现的次数或权重。
- 进行奇异值分解 (SVD)。
- 使用矩阵 S (SVD) 中的值进行降维(如何降维?)。
我希望你可以帮助我。 预先非常感谢您!
I would like to implement Latent Semantic Analysis (LSA) in PHP in order to find out topics/tags for texts.
Here is what I think I have to do. Is this correct? How can I code it in PHP? How do I determine which words to chose?
I don't want to use any external libraries. I've already an implementation for the Singular Value Decomposition (SVD).
- Extract all words from the given text.
- Weight the words/phrases, e.g. with tf–idf. If weighting is too complex, just take the number of occurrences.
- Build up a matrix: The columns are some documents from the database (the more the better?), the rows are all unique words, the values are the numbers of occurrences or the weight.
- Do the Singular Value Decomposition (SVD).
- Use the values in the matrix S (SVD) to do the dimension reduction (how?).
I hope you can help me. Thank you very much in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
LSA 链接:
这里是完整的算法。 如果您有 SVD,那么您已经成功了。 上面的论文比我解释得更好。
假设:
M:语料库矩阵,w(单词)乘d(文档)(w行,d列)。 这些可以是原始计数,也可以是 tfidf 或其他。 停用词可能会或可能不会被消除,并且可能会发生词干提取(Landauer 说保留停用词而不是词干,但 tfidf 是肯定的)。
然后是还原性……实际的 LSA 论文提出了一个很好的近似,即保留足够的向量,使其奇异值超过奇异值总数的 50%。
更简洁地...(伪代码)
这将返回新基础的排名,之前是 min(d,w),我们现在将用 {ii} 进行近似。
(此处,' -> prime,不是转置)
我们创建新矩阵:U',Sigma',V',大小为 wx ii、ii x ii 和 ii x d。
这就是LSA算法的本质。
所得矩阵 U' * Sigma' * V' 可用于“改进的”余弦相似度搜索,或者您可以为其中的每个文档选择前 3 个单词。 这是否比简单的 tf-idf 产生更多效果是一个有争议的问题。
对我来说,由于一词多义和主题过多的数据集,LSA 在现实世界的数据集中表现不佳。 它的数学/概率基础不健全(它假设正态(高斯)分布,这对字数统计没有意义)。
您的里程肯定会有所不同。
使用 LSA 进行标记(一种方法!)
使用 SVD 和缩减启发式构造 U' Sigma' V' 降维矩阵
手动查看 U' 矩阵,并提出描述每个“主题”的术语。 例如,如果该向量的最大部分是“布朗克斯、洋基队、曼哈顿”,那么“纽约市”可能是一个很好的术语。 将它们保存在关联数组或列表中。 此步骤应该是合理的,因为向量的数量是有限的。
假设您有一个文档的单词向量 (v1),那么 v1 * t(U') 将给出该文档的最强“主题”。 选择最高的 3 个,然后给出在上一步中计算出的“主题”。
LSA links:
Here is the complete algorithm. If you have SVD, you are most of the way there. The papers above explain it better than I do.
Assumptions:
M: corpus matrix, w (words) by d (documents) (w rows, d columns). These can be raw counts, or tfidf or whatever. Stopwords may or may not be eliminated, and stemming may happen (Landauer says keep stopwords and don't stem, but yes to tfidf).
Then the reductionality.... the actual LSA paper suggests a good approximation for the basis is to keep enough vectors such that their singular values are more than 50% of the total of the singular values.
More succintly... (pseudocode)
This will return the rank of the new basis, which was min(d,w) before, and we'll now approximate with {ii}.
(here, ' -> prime, not transpose)
We create new matrices: U',Sigma', V', with sizes w x ii, ii x ii, and ii x d.
That's the essence of the LSA algorithm.
This resultant matrix U' * Sigma' * V' can be used for 'improved' cosine similarity searching, or you can pick the top 3 words for each document in it, for example. Whether this yeilds more than a simple tf-idf is a matter of some debate.
To me, LSA performs poorly in real world data sets because of polysemy, and data sets with too many topics. It's mathematical / probabilistic basis is unsound (it assumes normal-ish (Gaussian) distributions, which don't makes sense for word counts).
Your mileage will definitely vary.
Tagging using LSA (one method!)
Construct the U' Sigma' V' dimensionally reduced matrices using SVD and a reduction heuristic
By hand, look over the U' matrix, and come up with terms that describe each "topic". For example, if the the biggest parts of that vector were "Bronx, Yankees, Manhattan," then "New York City" might be a good term for it. Keep these in a associative array, or list. This step should be reasonable since the number of vectors will be finite.
Assuming you have a vector (v1) of words for a document, then v1 * t(U') will give the strongest 'topics' for that document. Select the 3 highest, then give their "topics" as computed in the previous step.
这个答案不是直接回答发帖者的问题,而是回答如何自动标记新闻项目的元问题。 OP 提到了命名实体识别,但我相信它们的意思更多的是自动标记。 如果他们真的指的是 NER,那么这个回答就是废话 :)
考虑到这些限制(600 个项目/天,100-200 个字符/项目)和不同的来源,这里有一些标记选项:
手动。 分析师每天可以轻松完成 600 次这样的工作,可能只需要几个小时。 像亚马逊的 Mechanical Turk 这样的东西,或者让用户这样做,也可能是可行的。 拥有一定数量的“手工标记”,即使只有 50 或 100 个,也将成为比较下面自动生成的方法所提供的任何内容的良好基础。
降维、使用 LSA、主题模型(潜在狄利克雷分配)等......我在现实世界数据集上使用 LSA 的运气非常差,而且我对其统计基础不满意。 我发现 LDA 更好,并且有一个令人难以置信的邮件列表关于如何将主题分配给文本的最佳思考。
简单的启发式...如果您有实际的新闻项目,则利用新闻项目的结构。 专注于第一句,扔掉所有常用词(停用词),并从前两句中选择最好的 3 个名词。 或者,哎呀,把第一句中的所有名词都拿出来,看看会得到什么。 如果文本都是英文,那么对整个shebang进行词性分析,看看会得到什么。 对于结构化项目,例如新闻报道,LSA 和其他顺序无关方法 (tf-idf) 会抛出大量信息。
祝你好运!
(如果您喜欢这个答案,也许可以重新标记问题以适应它)
This answer isn't directly to the posters' question, but to the meta question of how to autotag news items. The OP mentions Named Entity Recognition, but I believe they mean something more along the line of autotagging. If they really mean NER, then this response is hogwash :)
Given these constraints (600 items / day, 100-200 characters / item) with divergent sources, here are some tagging options:
By hand. An analyst could easily do 600 of these per day, probably in a couple of hours. Something like Amazon's Mechanical Turk, or making users do it, might also be feasible. Having some number of "hand-tagged", even if it's only 50 or 100, will be a good basis for comparing whatever the autogenerated methods below get you.
Dimentionality reductions, using LSA, Topic-Models (Latent Dirichlet Allocation), and the like.... I've had really poor luck with LSA on real-world data sets and I'm unsatisfied with its statistical basis. LDA I find much better, and has an incredible mailing list that has the best thinking on how to assign topics to texts.
Simple heuristics... if you have actual news items, then exploit the structure of the news item. Focus on the first sentence, toss out all the common words (stop words) and select the best 3 nouns from the first two sentences. Or heck, take all the nouns in the first sentence, and see where that gets you. If the texts are all in english, then do part of speech analysis on the whole shebang, and see what that gets you. With structured items, like news reports, LSA and other order independent methods (tf-idf) throws out a lot of information.
Good luck!
(if you like this answer, maybe retag the question to fit it)
直到最后一步,一切看起来都正确。 SVD 的通常表示法是它返回三个矩阵 A = USV*。 S 是一个对角矩阵(意味着对角线全部为零),在这种情况下,它基本上给出了每个维度捕获原始数据的量的度量。 数字(“奇异值”)将会下降,您可以寻找有用的维度的下降。 否则,您只需选择任意数字 N 来表示要采用的维度数。
说到这里我就有点模糊了。 降维空间中的术语(单词)的坐标是 U 或 V,我认为这取决于它们是在输入矩阵的行还是列中。 另一方面,我认为单词的坐标将是 U 的行。即 U 的第一行对应于输入矩阵的第一行,即第一个单词。 然后,您只需将该行的前 N 列作为单词在缩小空间中的坐标即可。
HTH
更新:
到目前为止,此过程并未告诉您确切如何挑选标签。 我从未听说过有人使用 LSI 来选择标签(机器学习算法可能更适合该任务,例如决策树)。 LSI 告诉您两个单词是否相似。 这距离分配标签还有很长的路要走。
有两个任务 - a) 要使用的标签集是什么? b) 如何选择最好的三个标签? 我不太清楚 LSI 将如何帮助您回答 (a)。 您可以手动选择标签集。 但是,如果您使用 LSI,标签可能应该是文档中出现的单词。 然后,对于 (b),您想要挑选出与文档中找到的单词最接近的标签。 您可以尝试几种实现该方法的方法。 选择与文档中任何单词最接近的三个标签,其中紧密度是通过标签坐标(其在 U 中的行)和单词坐标(其行)之间的余弦相似度(参见维基百科)来衡量的在U)。
That all looks right, up to the last step. The usual notation for SVD is that it returns three matrices A = USV*. S is a diagonal matrix (meaning all zero off the diagonal) that, in this case, basically gives a measure of how much each dimension captures of the original data. The numbers ("singular values") will go down, and you can look for a drop-off for how many dimensions are useful. Otherwise, you'll want to just choose an arbitrary number N for how many dimensions to take.
Here I get a little fuzzy. The coordinates of the terms (words) in the reduced-dimension space is either in U or V, I think depending on whether they are in the rows or columns of the input matrix. Off hand, I think the coordinates for the words will be the rows of U. i.e. the first row of U corresponds to the first row of the input matrix, i.e. the first word. Then you just take the first N columns of that row as the word's coordinate in the reduced space.
HTH
Update:
This process so far doesn't tell you exactly how to pick out tags. I've never heard of anyone using LSI to choose tags (a machine learning algorithm might be more suited to the task, like, say, decision trees). LSI tells you whether two words are similar. That's a long way from assigning tags.
There are two tasks- a) what are the set of tags to use? b) how to choose the best three tags?. I don't have much of a sense of how LSI is going to help you answer (a). You can choose the set of tags by hand. But, if you're using LSI, the tags probably should be words that occur in the documents. Then for (b), you want to pick out the tags that are closest to words found in the document. You could experiment with a few ways of implementing that. Choose the three tags that are closest to any word in the document, where closeness is measured by the cosine similarity (see Wikipedia) between the tag's coordinate (its row in U) and the word's coordinate (its row in U).
在 链接文本。
具体来说,这里有一个关于潜在语义映射的论文的链接,它描述了如何获取文本的结果“主题”。
There is an additional SO thread on the perils of doing this all in PHP at link text.
Specifically, there is a link there to this paper on Latent Semantic Mapping, which describes how to get the resultant "topics" for a text.