使用 R 进行文本检索

发布于 2024-09-29 23:43:55 字数 82 浏览 5 评论 0原文

我一直在使用 R 的文本挖掘包,它确实是一个很棒的工具。我还没有找到检索支持,或者可能缺少一些功能。 如何使用R的文本挖掘包实现一个简单的VSM模型?

I have been using R's text mining package and its really a great tool. I have not found retrieval support or maybe there are functionalities I am missing.
How can a simple VSM model be implemented using the R's text mining package?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

鼻尖触碰 2024-10-06 23:43:55
# Sample R commands in support of my previous answer
require(fortunes)
require(tm)
sentences <- NULL
for (i in 1:10) sentences <- c(sentences,fortune(i)$quote)
d <- data.frame(textCol =sentences )
ds <- DataframeSource(d)
dsc<-Corpus(ds)
dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = TRUE))
dictC <- Dictionary(dtm)
# The query below is created from words in fortune(1) and fortune(2)
newQry <- data.frame(textCol = "lets stand up and be counted seems to work undocumented")
newQryC <- Corpus(DataframeSource(newQry))
dtmNewQry <- DocumentTermMatrix(newQryC, control = list(weighting=weightTf,stopwords=TRUE,dictionary=dict1))
dictQry <- Dictionary(dtmNewQry)
# Below does a naive similarity (number of features in common)
apply(dtm,1,function(x,y=dictQry){length(intersect(names(x)[x!= 0],y))})
# Sample R commands in support of my previous answer
require(fortunes)
require(tm)
sentences <- NULL
for (i in 1:10) sentences <- c(sentences,fortune(i)$quote)
d <- data.frame(textCol =sentences )
ds <- DataframeSource(d)
dsc<-Corpus(ds)
dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = TRUE))
dictC <- Dictionary(dtm)
# The query below is created from words in fortune(1) and fortune(2)
newQry <- data.frame(textCol = "lets stand up and be counted seems to work undocumented")
newQryC <- Corpus(DataframeSource(newQry))
dtmNewQry <- DocumentTermMatrix(newQryC, control = list(weighting=weightTf,stopwords=TRUE,dictionary=dict1))
dictQry <- Dictionary(dtmNewQry)
# Below does a naive similarity (number of features in common)
apply(dtm,1,function(x,y=dictQry){length(intersect(names(x)[x!= 0],y))})
请爱~陌生人 2024-10-06 23:43:55

假设 VSM = 向量空间模型,您可以通过以下方式建立一个简单的检索系统:

  • 创建您的集合/语料库的文档术语矩阵
  • 创建一个用于相似性度量的函数(Jaccard、Euclidean 等)。有一些可用的软件包具有这些功能。 RSiteSearch 应该有助于找到它们。
  • 将您的查询转换为文档术语矩阵(该矩阵将有 1 行,并使用与第一步相同的字典进行映射)
  • 计算查询与第一步中的矩阵的相似度。
  • 对结果进行排序并选择前 n 个。

一种非 R 方法是在 PostgreSQL 中的表的文本列(行是文档)上使用 GINI 索引。使用 ts_vector 查询方法,您可以拥有一个非常快速的检索系统。

Assuming VSM = Vector Space Model, you can go about a simple retrieval system in the following manner:

  • Create a Document Term Matrix of your collection/corpus
  • Create a function for your similarity measure (Jaccard, Euclidean, etc.). There are packages available with these functions. RSiteSearch should help in finding them.
  • Convert your query to a Document Term Matrix (which will have 1 row and is mapped using the same dictionary as used for the first step)
  • Compute similarity with the query and the matrix from the first step.
  • Rank the results and choose the top n.

A non-R method is to use the GINI index on a text column (rows are documents) of a table in PostgreSQL. Using the ts_vector querying methods, you can have a very fast retrieval system.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文