在 tm DocumentTermMatrix 中包含短标记

发布于 2025-01-03 12:12:38 字数 698 浏览 2 评论 0原文

编辑:这是工作区中的对象发生冲突并导致意外行为的问题。

我正在尝试使用以下代码从文档创建 DocumentTermMatrix。该文档包含许多 1 字符和 2 字符标记。然而,即使最小字长设置为 1 个字符,生成的矩阵也包含 699 个文档和 0 个术语。

library(tm)
data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",header=FALSE)
data <- data[-1]

training_data <- as.vector(apply(as.matrix(data, mode="character"),1,paste,collapse=" "))
corpus <- Corpus(VectorSource(training_data))

matrix <- DocumentTermMatrix(corpus,control=list(wordLengths=c(1,Inf)))

谁能解释一下为什么尽管数据中有许多 1 和 2 个字符标记,但没有创建标记?以下是一个示例数据条目:

" 4  8  8  5  4 5 10  4  1 4"

EDIT: This was an issue with objects in the workspace conflicting and causing unexpected behavior.

I am trying to create a DocumentTermMatrix from a document using the following code. The document contains many 1 and 2-character tokens. However, even when the minimum word length is set to 1 character, the resulting matrix contains 699 documents and 0 terms.

library(tm)
data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",header=FALSE)
data <- data[-1]

training_data <- as.vector(apply(as.matrix(data, mode="character"),1,paste,collapse=" "))
corpus <- Corpus(VectorSource(training_data))

matrix <- DocumentTermMatrix(corpus,control=list(wordLengths=c(1,Inf)))

Can anyone shed some light as to why no tokens are created despite there being many 1 and 2 character tokens in the data? Here is one sample data entry:

" 4  8  8  5  4 5 10  4  1 4"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

偷得浮生 2025-01-10 12:12:38

我在 Windows 7 机器上运行了您在最新版本的 R 和 tm 中提供的内容,并生成了您正在寻找的结果(见下文)。我会尝试清理您的工作区、退出 R 和/或重新启动。

> library(tm)
> data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",header=FALSE)
> data <- data[-1]
> 
> training_data <- as.vector(apply(as.matrix(data, mode="character"),1,paste,collapse=" "))
> corpus <- Corpus(VectorSource(training_data))
> 
> matrix <- DocumentTermMatrix(corpus,control=list(wordLengths=c(1,Inf)))
> matrix
A document-term matrix (699 documents, 11 terms)

Non-/sparse entries: 2899/4790
Sparsity           : 62%
Maximal term length: 2 
Weighting          : term frequency (tf)

I ran exactly what you gave me in the latest version of R and tm on a windows 7 machine and produced the results you were looking for(see below). I'd try clearing your workspace, exiting R and/or rebooting.

> library(tm)
> data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",header=FALSE)
> data <- data[-1]
> 
> training_data <- as.vector(apply(as.matrix(data, mode="character"),1,paste,collapse=" "))
> corpus <- Corpus(VectorSource(training_data))
> 
> matrix <- DocumentTermMatrix(corpus,control=list(wordLengths=c(1,Inf)))
> matrix
A document-term matrix (699 documents, 11 terms)

Non-/sparse entries: 2899/4790
Sparsity           : 62%
Maximal term length: 2 
Weighting          : term frequency (tf)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文