R文本文件和文本挖掘...如何加载数据

发布于 2024-12-12 09:13:46 字数 234 浏览 6 评论 0原文

我正在使用 R 包 tm，我想做一些文本挖掘。这是一个文档，被视为一个词袋。

我不明白有关如何加载文本文件并创建必要的对象以开始使用诸如...之类的功能的文档，

stemDocument(x, language = map_IETF(Language(x)))

因此假设这是我的文档“这是 R 加载的测试”

我如何加载用于文本处理和创建对象 x 的数据？

原文

I am using the R package tm and I want to do some text mining. This is one document and is treated as a bag of words.

I don't understand the documentation on how to load a text file and to create the necessary objects to start using features such as....

stemDocument(x, language = map_IETF(Language(x)))

So assume that this is my doc "this is a test for R load"

How do I load the data for text processing and to create the object x?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

万水千山粽是情ミ 2024-12-19 09:13:46

就像@richiemorrisroe 一样，我发现这方面的记录很少。以下是我如何将文本与 tm 包一起使用并制作文档术语矩阵：

library(tm) #load text mining library
setwd('F:/My Documents/My texts') #sets R's working directory to near where my files are
a  <-Corpus(DirSource("/My Documents/My texts"), readerControl = list(language="lat")) #specifies the exact folder where my text file(s) is for analysis with tm.
summary(a)  #check what went in
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english")) # this stopword file is at C:\Users\[username]\Documents\R\win-library\2.13\tm\stopwords 
a <- tm_map(a, stemDocument, language = "english")
adtm <-DocumentTermMatrix(a) 
adtm <- removeSparseTerms(adtm, 0.75)

在这种情况下，您不需要指定确切的文件名。只要它是第 3 行引用的目录中唯一的一个，tm 函数就会使用它。我这样做是因为我在第 3 行中指定文件名没有成功。

如果有人可以建议如何将文本放入 lda 包中，我将不胜感激。我根本没能解决这个问题。

Like @richiemorrisroe I found this poorly documented. Here's how I get my text in to use with the tm package and make the document term matrix:

library(tm) #load text mining library
setwd('F:/My Documents/My texts') #sets R's working directory to near where my files are
a  <-Corpus(DirSource("/My Documents/My texts"), readerControl = list(language="lat")) #specifies the exact folder where my text file(s) is for analysis with tm.
summary(a)  #check what went in
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english")) # this stopword file is at C:\Users\[username]\Documents\R\win-library\2.13\tm\stopwords 
a <- tm_map(a, stemDocument, language = "english")
adtm <-DocumentTermMatrix(a) 
adtm <- removeSparseTerms(adtm, 0.75)

In this case you don't need to specify the exact file name. So long as it's the only one in the directory referred to in line 3, it will be used by the tm functions. I do it this way because I have not had any success in specifying the file name in line 3.

If anyone can suggest how to get text into the lda package I'd be most grateful. I haven't been able to work that out at all.

回复收藏 0 原文

零時差 2024-12-19 09:13:46

难道不能只使用同一个库中的函数readPlain吗？或者您可以只使用更常见的 scan 功能。

mydoc.txt <-scan("./mydoc.txt", what = "character")

Can't you just use the function readPlain from the same library? Or you could just use the more common scan function.

mydoc.txt <-scan("./mydoc.txt", what = "character")

回复收藏 0 原文

向地狱狂奔 2024-12-19 09:13:46

事实上，我一开始就发现这很棘手，所以这里有一个更全面的解释。

首先，您需要为文本文档设置源。我发现最简单的方法（特别是如果您计划添加更多文档）是创建一个目录源来读取所有文件。

source <- DirSource("yourdirectoryname/") #input path for documents
YourCorpus <- Corpus(source, readerControl=list(reader=readPlain)) #load in documents

然后您可以将 StemDocument 函数应用到您的 Corpus.HTH。

I actually found this quite tricky to begin with, so here's a more comprehensive explanation.

First, you need to set up a source for your text documents. I found that the easiest way (especially if you plan on adding more documents, is to create a directory source that will read all of your files in.

source <- DirSource("yourdirectoryname/") #input path for documents
YourCorpus <- Corpus(source, readerControl=list(reader=readPlain)) #load in documents

You can then apply the StemDocument function to your Corpus. HTH.

回复收藏 0 原文

水波映月 2024-12-19 09:13:46

我相信您想要做的是将单个文件读入语料库，然后使其将文本文件中的不同行视为不同的观察结果。

看看这是否为您提供了您想要的内容：

text <- read.delim("this is a test for R load.txt", sep = "/t")
text_corpus <- Corpus(VectorSource(text), readerControl = list(language = "en"))

假设文件“这是 R load.txt 的测试”只有一列包含文本数据。

这里的“text_corpus”是您正在寻找的对象。

希望这有帮助。

I believe what you wanted to do was read individual file into a corpus and then make it treat the different rows in the text file as different observations.

See if this gives you what you want:

text <- read.delim("this is a test for R load.txt", sep = "/t")
text_corpus <- Corpus(VectorSource(text), readerControl = list(language = "en"))

This is assuming that the file "this is a test for R load.txt" has only one column which has the text data.

Here the "text_corpus" is the object that you are looking for.

Hope this helps.

回复收藏 0 原文

抠脚大汉 2024-12-19 09:13:46

这是我针对每个观察一行的文本文件的解决方案。 tm 上的最新小插图（2017 年 2 月）提供了更多详细信息。

text <- read.delim(textFileName, header=F, sep = "\n",stringsAsFactors = F)
colnames(text) <- c("MyCol")
docs <- text$MyCol
a <- VCorpus(VectorSource(docs))

Here's my solution for a text file with a line per observation. the latest vignette on tm (Feb 2017) gives more detail.

text <- read.delim(textFileName, header=F, sep = "\n",stringsAsFactors = F)
colnames(text) <- c("MyCol")
docs <- text$MyCol
a <- VCorpus(VectorSource(docs))

回复收藏 0 原文

窝囊感情。 2024-12-19 09:13:46

下面假设您有一个文本文件目录，您想从中创建一个词袋。

唯一需要进行的更改是替换
path = "C:\\windows\\path\\to\\text\\files\\
与您的目录路径。

library(tidyverse)
library(tidytext)

# create a data frame listing all files to be analyzed
all_txts <- list.files(path = "C:\\windows\\path\\to\\text\\files\\",   # path can be relative or absolute
                       pattern = ".txt$",  # this pattern only selects files ending with .txt
                       full.names = TRUE)  # gives the file path as well as name

# create a data frame with one word per line
my_corpus <- map_dfr(all_txts, ~ tibble(txt = read_file(.x)) %>%   # read in each file in list
                      mutate(filename = basename(.x)) %>%   # add the file name as a new column
                      unnest_tokens(word, txt))   # split each word out as a separate row

# count the total # of rows/words in your corpus
my_corpus %>%
  summarize(number_rows = n())

# group and count by "filename" field and sort descending
my_corpus %>%
  group_by(filename) %>%
  summarize(number_rows = n()) %>%
  arrange(desc(number_rows))

# remove stop words
my_corpus2 <- my_corpus %>%
  anti_join(stop_words)

# repeat the count after stop words are removed
my_corpus2 %>%
  group_by(filename) %>%
  summarize(number_rows = n()) %>%
  arrange(desc(number_rows))

The following assumes you have a directory of text files from which you want to create a bag of words.

The only change that needs to be made is replace
path = "C:\\windows\\path\\to\\text\\files\\
with your directory path.

library(tidyverse)
library(tidytext)

# create a data frame listing all files to be analyzed
all_txts <- list.files(path = "C:\\windows\\path\\to\\text\\files\\",   # path can be relative or absolute
                       pattern = ".txt$",  # this pattern only selects files ending with .txt
                       full.names = TRUE)  # gives the file path as well as name

# create a data frame with one word per line
my_corpus <- map_dfr(all_txts, ~ tibble(txt = read_file(.x)) %>%   # read in each file in list
                      mutate(filename = basename(.x)) %>%   # add the file name as a new column
                      unnest_tokens(word, txt))   # split each word out as a separate row

# count the total # of rows/words in your corpus
my_corpus %>%
  summarize(number_rows = n())

# group and count by "filename" field and sort descending
my_corpus %>%
  group_by(filename) %>%
  summarize(number_rows = n()) %>%
  arrange(desc(number_rows))

# remove stop words
my_corpus2 <- my_corpus %>%
  anti_join(stop_words)

# repeat the count after stop words are removed
my_corpus2 %>%
  group_by(filename) %>%
  summarize(number_rows = n()) %>%
  arrange(desc(number_rows))

回复收藏 0 原文

~没有更多了~