潜在语义分析在开发搜索引擎中的作用是什么?
我正在尝试为我的最后一年项目开发一个以音乐为中心的搜索引擎。我一直在研究潜在语义分析及其在互联网上的工作原理。我无法理解 LSI 在整个搜索引擎系统中的确切位置。 是否应该在网络爬虫完成网页查找后使用它?
I am trying to develop a music-focused search engine for my final year project.I have been doing some research on Latent Semantic Analysis and how it works on the Internet. I am having trouble understanding where LSI sits exactly in the whole system of search engines.
Should it be used after a web crawler has finished looking up web pages?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我对音乐检索不太了解,但在文本检索中,只有当搜索引擎使用信息检索的向量空间模型时,LSA 才相关。最常见的搜索引擎,例如 Lucene,将每个文档分解为单词(标记),删除停用词并将其余的放入索引中,每个通常与一个术语权重相关联,指示其重要性文档中的术语。
现在(令牌,权重)对的列表可以被视为表示文档的向量。如果将所有这些向量组合成一个巨大的矩阵,并对其应用 LSA 算法(在爬行和标记化之后,但在索引之前),您可以使用LSA 算法在索引之前转换所有文档的向量。
请注意,在原始向量中,标记表示向量空间的维度。 LSA 将为您提供一组新的维度,您必须对这些维度(例如以自动生成的整数的形式)而不是标记进行索引。
此外,您还必须将查询转换为(令牌、权重)对的向量,然后也将基于 LSA 的转换应用于该向量。
我不确定是否有人真的在现实生活中的文本检索引擎中完成了所有这些工作。一个问题是,对所有文档向量的矩阵执行LSA算法会消耗大量时间和内存。另一个问题是处理更新,即当添加新文档或现有文档发生更改时。理想情况下,您需要重新计算矩阵,重新运行 LSA,然后修改所有现有文档向量并重新生成整个索引。不完全可扩展。
I don't know much about music retrieval, but in text retrieval, LSA is only relevant if the search engine is making use of the vector space model of information retrieval. Most common search engines, such as Lucene, break each document up into words (tokens), remove stop words and put the rest of them into the index, each usually associated with a term weight indicating the importance of the term within the document.
Now the list of (token,weight) pairs can be viewed as a vector representing the document. If you combine all of these vectors into a huge matrix and apply the LSA algorithm to that (after crawling and tokenising, but before indexing), you can use the result of the LSA algorithm to transform the vectors of all documents before indexing them.
Note that in the original vectors, the tokens represented the dimensions of the vector space. LSA will give you a new set of dimensions, and you'll have to index those (e.g. in the form of auto-generated integers) instead of the tokens.
Furthermore, you will have to transform the query into a vector of (token,weight) pairs, too, and then apply the LSA-based transformation to that vector as well.
I am unsure if anybody actually does all of this in any real-life text retrieval engine. One problem is that performing the LSA algorithm on the matrix of all document vectors consumes a lot of time and memory. Another problem is handling updates, i.e. when a new document is added, or an existing one changes. Ideally, you'd recompute the matrix, re-run LSA, and then modify all existing document vectors and re-generate the entire index. Not exactly scalable.