R 文本挖掘:计算特定单词在语料库中出现的次数?
我已经看到这个问题在其他语言中得到了回答,但在 R 中却没有。
[专门用于 R 文本挖掘]我有一组从语料库中获得的常用短语。现在我想搜索这些短语在另一个语料库中出现的次数。
有没有办法在TM包中做到这一点? (或另一个相关的包)
例如,假设我有一组从 CorpusA 获得的短语“标签”。还有另一个语料库 CorpusB,包含数千个子文本。我想知道标签中的每个短语在 CorpusB 中出现了多少次。
一如既往,我感谢您的所有帮助!
I have seen this question answered in other languages but not in R.
[Specifically for R text mining] I have a set of frequent phrases that is obtained from a Corpus. Now I would like to search for the number of times these phrases have appeared in another corpus.
Is there a way to do this in TM package? (Or another related package)
For example, say I have an array of phrases, "tags" obtained from CorpusA. And another Corpus, CorpusB, of couple thousand sub texts. I want to find out how many times each phrase in tags have appeared in CorpusB.
As always, I appreciate all your help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
并不完美,但这应该可以帮助您入门。
Ain't perfect but this should get you started.
如果我理解正确,那么
tm
包可以用于此目的:一些可重现的数据...
现在删除停用词、数字、标点符号等。
将处理后的语料库转换为术语文档矩阵:
获取最多第一个语料库中频繁出现的单词:
这里是应该解决问题的关键行找出这些标签在其他 tdms 中出现的次数:
If I understand correctly, here's how the
tm
package could be used for this:Some reproducible data...
Now remove stopwords, numbers, punctuation, etc.
Convert processed corpora to term document matrix:
Get the most frequently occuring words in the first corpus:
Here are the key lines that should do the trick Find out how many times those tags occur in the other tdms:
这就是我现在解决问题的方法:
## 现在是代码:
This is how I'd approach the problem now:
## Now the code: