R文本文件和文本挖掘...如何加载数据
我正在使用 R 包 tm,我想做一些文本挖掘。这是一个文档,被视为一个词袋。
我不明白有关如何加载文本文件并创建必要的对象以开始使用诸如...之类的功能的文档,
stemDocument(x, language = map_IETF(Language(x)))
因此假设这是我的文档“这是 R 加载的测试”
我如何加载用于文本处理和创建对象 x 的数据?
I am using the R package tm
and I want to do some text mining. This is one document and is treated as a bag of words.
I don't understand the documentation on how to load a text file and to create the necessary objects to start using features such as....
stemDocument(x, language = map_IETF(Language(x)))
So assume that this is my doc "this is a test for R load"
How do I load the data for text processing and to create the object x?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
就像@richiemorrisroe 一样,我发现这方面的记录很少。以下是我如何将文本与 tm 包一起使用并制作文档术语矩阵:
在这种情况下,您不需要指定确切的文件名。只要它是第 3 行引用的目录中唯一的一个,tm 函数就会使用它。我这样做是因为我在第 3 行中指定文件名没有成功。
如果有人可以建议如何将文本放入 lda 包中,我将不胜感激。我根本没能解决这个问题。
Like @richiemorrisroe I found this poorly documented. Here's how I get my text in to use with the tm package and make the document term matrix:
In this case you don't need to specify the exact file name. So long as it's the only one in the directory referred to in line 3, it will be used by the tm functions. I do it this way because I have not had any success in specifying the file name in line 3.
If anyone can suggest how to get text into the lda package I'd be most grateful. I haven't been able to work that out at all.
难道不能只使用同一个库中的函数
readPlain
吗?或者您可以只使用更常见的scan
功能。Can't you just use the function
readPlain
from the same library? Or you could just use the more commonscan
function.事实上,我一开始就发现这很棘手,所以这里有一个更全面的解释。
首先,您需要为文本文档设置源。我发现最简单的方法(特别是如果您计划添加更多文档)是创建一个目录源来读取所有文件。
然后您可以将 StemDocument 函数应用到您的 Corpus.HTH。
I actually found this quite tricky to begin with, so here's a more comprehensive explanation.
First, you need to set up a source for your text documents. I found that the easiest way (especially if you plan on adding more documents, is to create a directory source that will read all of your files in.
You can then apply the StemDocument function to your Corpus. HTH.
我相信您想要做的是将单个文件读入语料库,然后使其将文本文件中的不同行视为不同的观察结果。
看看这是否为您提供了您想要的内容:
假设文件“这是 R load.txt 的测试”只有一列包含文本数据。
这里的“text_corpus”是您正在寻找的对象。
希望这有帮助。
I believe what you wanted to do was read individual file into a corpus and then make it treat the different rows in the text file as different observations.
See if this gives you what you want:
This is assuming that the file "this is a test for R load.txt" has only one column which has the text data.
Here the "text_corpus" is the object that you are looking for.
Hope this helps.
这是我针对每个观察一行的文本文件的解决方案。 tm 上的最新小插图(2017 年 2 月)提供了更多详细信息。
Here's my solution for a text file with a line per observation. the latest vignette on tm (Feb 2017) gives more detail.
下面假设您有一个文本文件目录,您想从中创建一个词袋。
唯一需要进行的更改是替换
path = "C:\\windows\\path\\to\\text\\files\\
与您的目录路径。
The following assumes you have a directory of text files from which you want to create a bag of words.
The only change that needs to be made is replace
path = "C:\\windows\\path\\to\\text\\files\\
with your directory path.