在 Ruby 中使用 LSA 转换从文档集中发现同义词

发布于 2024-10-27 04:27:33 字数 1694 浏览 10 评论 0原文

将 LSA 变换应用于文档数组后,如何使用它来生成同义词?例如,我有以下示例文档:

D1 = Mobilization
D2 = 反光路面
D3 = 交通维护
D4 = 特别绕行
D5 = 车道商业材料

            D1    D2    D3    D4    D5    
commerci[ +0.00 +0.00 +0.00 +0.00 +1.00 ]  
  materi[ +0.00 +0.00 +0.00 +0.00 +1.00 ]  
drivewai[ +0.00 +0.00 +0.00 +0.00 +1.00 ]  
 special[ +0.00 +0.00 +0.00 +1.00 +0.00 ]  
  detour[ +0.00 +0.00 +0.00 +1.00 +0.00 ]  
 mainten[ +0.00 +0.00 +1.00 +0.00 +0.00 ]  
 traffic[ +0.00 +0.00 +1.00 +0.00 +0.00 ]  
 reflect[ +0.00 +1.00 +0.00 +0.00 +0.00 ]  
pavement[ +0.00 +1.00 +0.00 +0.00 +0.00 ]  
  mobil [ +1.00 +0.00 +0.00 +0.00 +0.00 ]  

应用 TFIDF 变换

            D1    D2    D3    D4    D5  
commerci[ +0.00 +0.00 +0.00 +0.00 +0.54 ]  
  materi[ +0.00 +0.00 +0.00 +0.00 +0.54 ]  
drivewai[ +0.00 +0.00 +0.00 +0.00 +0.54 ]  
 special[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
  detour[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
 mainten[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 traffic[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 reflect[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
pavement[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
  mobil [ +1.61 +0.00 +0.00 +0.00 +0.00 ]  

应用 LSA 变换

            D1    D2    D3    D4    D5  
commerci[ +0.00 +0.00 +0.00 +0.00 +0.00 ]  
  materi[ +0.00 +0.00 +0.00 +0.00 +0.00 ]  
drivewai[ +0.00 +0.00 +0.00 +0.00 +0.00 ]  
 special[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
  detour[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
 mainten[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 traffic[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 reflect[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
pavement[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
  mobil [ +1.61 +0.00 +0.00 +0.00 +0.00 ]  

After applying the LSA transform to a document array, how can this be used to generate synonyms? For instance, I have the following sample documents:

D1 = Mobilization
D2 = Reflective Pavement
D3 = Maintenance of Traffic
D4 = Special Detour
D5 = Commercial Materials for Driveway

            D1    D2    D3    D4    D5    
commerci[ +0.00 +0.00 +0.00 +0.00 +1.00 ]  
  materi[ +0.00 +0.00 +0.00 +0.00 +1.00 ]  
drivewai[ +0.00 +0.00 +0.00 +0.00 +1.00 ]  
 special[ +0.00 +0.00 +0.00 +1.00 +0.00 ]  
  detour[ +0.00 +0.00 +0.00 +1.00 +0.00 ]  
 mainten[ +0.00 +0.00 +1.00 +0.00 +0.00 ]  
 traffic[ +0.00 +0.00 +1.00 +0.00 +0.00 ]  
 reflect[ +0.00 +1.00 +0.00 +0.00 +0.00 ]  
pavement[ +0.00 +1.00 +0.00 +0.00 +0.00 ]  
  mobil [ +1.00 +0.00 +0.00 +0.00 +0.00 ]  

Applying TFIDF transform

            D1    D2    D3    D4    D5  
commerci[ +0.00 +0.00 +0.00 +0.00 +0.54 ]  
  materi[ +0.00 +0.00 +0.00 +0.00 +0.54 ]  
drivewai[ +0.00 +0.00 +0.00 +0.00 +0.54 ]  
 special[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
  detour[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
 mainten[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 traffic[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 reflect[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
pavement[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
  mobil [ +1.61 +0.00 +0.00 +0.00 +0.00 ]  

Applying LSA transform

            D1    D2    D3    D4    D5  
commerci[ +0.00 +0.00 +0.00 +0.00 +0.00 ]  
  materi[ +0.00 +0.00 +0.00 +0.00 +0.00 ]  
drivewai[ +0.00 +0.00 +0.00 +0.00 +0.00 ]  
 special[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
  detour[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
 mainten[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 traffic[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 reflect[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
pavement[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
  mobil [ +1.61 +0.00 +0.00 +0.00 +0.00 ]  

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

若相惜即相离 2024-11-03 04:27:33

首先,这个例子行不通。其背后的原理是,单词在相似上下文中出现的频率越高,它们在含义上的相关性就越强。因此,输入文档之间需要有一些重叠。段落长度的文档是理想的(因为它们具有合理的字数,并且每个段落往往有一个主题)。

要了解 LSA 如何用于同义词识别,您首先需要了解单词出现的向量空间表示(您获得的第一个矩阵)如何用于同义词识别。这是因为您可以计算这个高维向量空间中两个项目之间的距离作为它们相似性的度量(假设它是它们一起出现的频率的度量)。 LSA 的神奇之处在于,它重新调整了向量空间的维度,因此,不会同时出现但出现在相似上下文中的项目会通过相似维度的相互折叠而聚集在一起。

TFIDF 加权函数的想法是通过为在语料库的较小子集中出现较多的单词赋予较高的权重,并为到处使用的单词赋予较低的权重来突出文档之间的差异。 更彻底的解释。

“LSA”变换实际上是一个奇异的变换-值分解 (SVD) – 传统上潜在语义分析或潜在语义索引是指 TFIDF 与 SVD 的组合 – 它用于减少向量空间的维度,或者换句话说,它将列数减少为更小的,更简洁的描述(如上所述)。

因此,要了解问题的核心:您可以通过对两个相应的向量(行)应用距离函数来判断单词的相似程度。有多种距离函数可供选择,最常用的是 余弦距离 (测量两个向量之间的角度)。

希望这能让事情变得更清楚。

Firstly, this example won't work. The principle behind it is that the more frequently words occur in similar contexts, the more related they are in meaning. Therefore there needs to be some overlap between the input documents. Paragraph length documents are ideal (since they have a reasonable number of words and there tends to be a single topic per paragraph).

To understand how LSA is useful for synonym recognition, you need to first understand how a vector space representation (the first matrix you've got there) of words occurrences is useful for synonym recognition in the first place. This is because you can calculate the distance between two items in this high dimensionality vector space as a measure of their similarity (given that it is a measure of how often they occur together). The Magic of LSA is that it reshuffles the dimensions of the vector space, so that items that don't occur together but occur in similar contexts are brought together by a collapsing of similar dimensions into each other.

The idea of the TFIDF weighting function is to highlight the differences between documents, by giving higher weightings to words that appear more in a smaller subset of of the corpus, and lower weightings to words that are used everywhere. A more thorough explanation.

The "LSA" transformation is actually a singular-value decomposition (SVD) – conventionally Latent Semantic Analysis or Latent Semantic Indexing refers to the combination TFIDF with SVD – and it serves to reduce the dimensionally of the vector space, or in other words, it reduces the number of columns into a smaller, more concise description (as described above).

So to get the the nub of your question: you can tell how similar to words are by applying a distance function to the two corresponding vectors (rows). There are several distance functions to choose from by the most commonly used is the cosine distance (which measures the angle between the two vectors).

Hope this makes things clearer.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文