发现 2 &使用 R TM 包的 3 个单词短语
我正在尝试找到一个代码,该代码实际上可以在 R 文本挖掘包中查找最常用的两个和三个单词短语(也许还有另一个我不知道的包)。我一直在尝试使用标记器,但似乎没有运气。
如果您过去处理过类似的情况,您可以发布经过测试且实际有效的代码吗?太感谢了!
I am trying to find a code that actually works to find the most frequently used two and three word phrases in R text mining package (maybe there is another package for it that I do not know). I have been trying to use the tokenizer, but seem to have no luck.
If you worked on a similar situation in the past, could you post a code that is tested and actually works? Thank you so much!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
您可以将自定义标记化函数传递给
tm
的DocumentTermMatrix
函数,因此如果您安装了包tau
,那么这将相当简单。其中,
tokenize_ngrams
函数中的n
是每个短语的单词数。此功能也在 RTextTools 包中实现,这进一步简化了事情。这将返回一个
DocumentTermMatrix
类,以便与包tm
一起使用。You can pass in a custom tokenizing function to
tm
'sDocumentTermMatrix
function, so if you have packagetau
installed it's fairly straightforward.Where
n
in thetokenize_ngrams
function is the number of words per phrase. This feature is also implemented in packageRTextTools
, which further simplifies things.This returns a class of
DocumentTermMatrix
for use with packagetm
.这是 tm 包:
This is part 5 of the FAQ of the tm package:
这是我自己为不同目的而创作的作品,但我认为也可能适用于您的需求:
This is my own made up creation for different purposes but I think may applicable to your needs too:
corpus 库有一个名为
term_stats
的函数,可以执行您想要的操作:这里,
count
是出现次数,support< /code> 是包含该术语的文档数。
The corpus library has a function called
term_stats
that does what you want:Here,
count
is the number of appearances, andsupport
is the number of documents containing the term.我通过使用 tm 和 ngram 包添加了类似的问题。
调试
mclapply
后,我发现少于 2 个单词的文档存在问题,并出现以下错误因此我添加了一个过滤器来删除字数较少的文档:
然后我的 tokenize 函数如下所示:
然后您可以使用以下命令测试该功能:
最后:
I add a similar problem by using
tm
andngram
packages.After debugging
mclapply
, I saw there where problems on documents with less than 2 words with the following errorSo I've added a filter to remove document with low word count number:
Then my tokenize function looks like:
Then you can test the function with:
And finally:
尝试 tidytext 包
)
假设我有一个包含评论列的数据框 CommentData,并且我想查找两个单词一起出现的情况。然后尝试
上面的代码创建标记,然后删除对分析没有帮助的停止词(例如,the,an,to等),然后计算这些词的出现次数。然后,您将使用联合功能来组合单个单词并记录它们的出现。
Try tidytext package
)
Suppose I have a dataframe CommentData that contains comment column and I want to find occurrence of two words together. Then try
The above code creates tokens, and then remove stop words that doesn't help in analysis(eg. the,an,to etc.) Then you count occurrence of these words. You will be then using unite function to combine individual words and record their occurrence.
试试这个代码。
这是您可以使用的二元组或三元组部分
希望这会有所帮助。这是您可以使用的完整代码。
Try this code.
This is the bigrams or tri grams section that you could use
Hope this helps. This is an entire code that you could use.