如何使用R语言tm(文本挖掘)包中的stemDocument?
我正在尝试使用调用Java的R语言tm包中的stemDocument来阻止语料库。 我已经尝试了 tm 手册中的示例:
data("crude")
crude[[1]]
stemDocument(crude[[1]])
并收到以下错误:
Could not initialize the GenericProperitiesCreator. This exception was produced:
java.lang.NullPointerException
任何帮助表示赞赏。我对Java一无所知。
谢谢
I am trying to stem a Corpus using stemDocument in the R language tm package which calls Java.
I have tried the example in the tm manual:
data("crude")
crude[[1]]
stemDocument(crude[[1]])
and get the following error:
Could not initialize the GenericProperitiesCreator. This exception was produced:
java.lang.NullPointerException
Any help appreciated. I know nothing about Java.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
好问题,你解决了吗?
我只用你拥有的代码得到同样的错误。但是,如果您从一开始就遵循该示例(即标题“transformations on p.1),您创建一个语料库并将其转换为纯文本文档,然后您就可以避免 Java 错误。我猜 手册 中的代码示例假设您'我已经完成了这两个步骤。
也就是说,当我检查结果时,没有实际的词干...我什至无法得到 @user813966 的 stemDocument 进行任何词干提取的简单示例。我正在查看 RStem 和 SnowBall 包。
同时,python 包 NLTK 是我的词干提取工具。
更新:我通过添加
language =“english”
来使stemDocument功能正常工作,如下所示:因此,您问题的完整答案是遵循将文本输入到的所有步骤R根据tm包。您还需要 rJava(如果您在 Windows 中工作,则需要将 JAVA_HOME 的环境变量设置为包含 jre 目录的目录)以使 StemDocument 工作
Good question, did you work it out?
I get the same error with the only the code that you have. But if you follow the example from the start (ie. at the heading 'transformations on p. 1) and you create a corpus and convert it to a Plain Text Document then you avoid the Java error. I guess that the code example in the manual assumes you've already done those two steps.
That said, when I inspect the results, there's no actual stemming... I can't even get @user813966's simple example of stemDocument to do any stemming. I'm looking at the RStem and SnowBall packages instead.
In the meantime, the python package NLTK is my stemming tool.
Update: I got the stemDocument function working by adding
language = "english"
as follows:So the complete answer to your question is to follow all the steps of inputting your text into R according to the tm package. You'll also need rJava (and to set environment variables for JAVA_HOME to the directory containing the jre directory if you're working in windows) to make stemDocument work
我这边也有同样的错误。通过在我的类路径中添加 Snowball .jar 和相应的 /words 词干存储库来解决这个问题:
C:\Users\xxx.xxx\Documents\R\win-library\2.12\Snowball\java
此处推荐: http://weka.wikispaces.com/Stemmers
我仍然有以下错误,但现在工作正常:
I had same error on my side. Solved it by adding the Snowball .jar and the corresponding /words repository of stem words in my class path:
C:\Users\xxx.xxx\Documents\R\win-library\2.12\Snowball\java
This was recommended here: http://weka.wikispaces.com/Stemmers
I still have the following error but it works fine now:
Snowball 词干分析器 (snowball.jar) 找不到 weka.jar 文件。
在您的计算机上,您需要搜索名为 weka.jar 的文件。在我的 Linux 系统上,它位于
然后,在 R 代码中添加与顶部类似的行:
这会将 R 会话的 Java CLASSPATH 设置为 weka.jar上面的文件。不过,您现有的类路径将被重置。如果您有一些并且需要的话,您可以尝试将旧条目添加回来。
Snowball stemmer (snowball.jar) cannot find the weka.jar file.
On your computer, you need to search for a file called weka.jar . On my linux system, it is located in
Then, in your R code, add lines similar to these at the top:
This sets the Java CLASSPATH for the R Session to the weka.jar file from above . Your existing classpath will be reset, though. You can try to add the old entries back if you have some , and if you need them.