如何使用R语言tm（文本挖掘）包中的stemDocument？

发布于 2024-12-07 07:09:05 字数 341 浏览 2 评论 0原文

我正在尝试使用调用Java的R语言tm包中的stemDocument来阻止语料库。我已经尝试了 tm 手册中的示例：

data("crude")
crude[[1]]
stemDocument(crude[[1]])

并收到以下错误：

Could not initialize the GenericProperitiesCreator.  This exception was produced:  
java.lang.NullPointerException

任何帮助表示赞赏。我对Java一无所知。

谢谢

原文

I am trying to stem a Corpus using stemDocument in the R language tm package which calls Java.
I have tried the example in the tm manual:

data("crude")
crude[[1]]
stemDocument(crude[[1]])

and get the following error:

Could not initialize the GenericProperitiesCreator.  This exception was produced:  
java.lang.NullPointerException

Any help appreciated. I know nothing about Java.

Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

柠檬 2024-12-14 07:09:05

好问题，你解决了吗？

我只用你拥有的代码得到同样的错误。但是，如果您从一开始就遵循该示例（即标题“transformations on p.1），您创建一个语料库并将其转换为纯文本文档，然后您就可以避免 Java 错误。我猜手册中的代码示例假设您'我已经完成了这两个步骤。

也就是说，当我检查结果时，没有实际的词干...我什至无法得到 @user813966 的 stemDocument 进行任何词干提取的简单示例。我正在查看 RStem 和 SnowBall 包。

同时，python 包 NLTK 是我的词干提取工具。

更新：我通过添加language =“english”来使stemDocument功能正常工作，如下所示：

a <- tm_map(a, stemDocument, language = "english")

因此，您问题的完整答案是遵循将文本输入到的所有步骤R根据tm包。您还需要 rJava（如果您在 Windows 中工作，则需要将 JAVA_HOME 的环境变量设置为包含 jre 目录的目录）以使 StemDocument 工作

Good question, did you work it out?

I get the same error with the only the code that you have. But if you follow the example from the start (ie. at the heading 'transformations on p. 1) and you create a corpus and convert it to a Plain Text Document then you avoid the Java error. I guess that the code example in the manual assumes you've already done those two steps.

That said, when I inspect the results, there's no actual stemming... I can't even get @user813966's simple example of stemDocument to do any stemming. I'm looking at the RStem and SnowBall packages instead.

In the meantime, the python package NLTK is my stemming tool.

Update: I got the stemDocument function working by adding language = "english" as follows:

a <- tm_map(a, stemDocument, language = "english")

So the complete answer to your question is to follow all the steps of inputting your text into R according to the tm package. You'll also need rJava (and to set environment variables for JAVA_HOME to the directory containing the jre directory if you're working in windows) to make stemDocument work

回复收藏 0 原文

幽蝶幻影 2024-12-14 07:09:05

我这边也有同样的错误。通过在我的类路径中添加 Snowball .jar 和相应的 /words 词干存储库来解决这个问题：
C:\Users\xxx.xxx\Documents\R\win-library\2.12\Snowball\java

此处推荐： http://weka.wikispaces.com/Stemmers

我仍然有以下错误，但现在工作正常：

Trying to add database driver (JDBC): RmiJdbc.RJDriver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): jdbc.idbDriver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): org.gjt.mm.mysql.Driver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): com.mckoi.JDBCDriver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): org.hsqldb.jdbcDriver - Warning, not in CLASSPATH?
[KnowledgeFlow] Loading properties and plugins...
[KnowledgeFlow] Initializing KF...

I had same error on my side. Solved it by adding the Snowball .jar and the corresponding /words repository of stem words in my class path:
C:\Users\xxx.xxx\Documents\R\win-library\2.12\Snowball\java

This was recommended here: http://weka.wikispaces.com/Stemmers

I still have the following error but it works fine now:

Trying to add database driver (JDBC): RmiJdbc.RJDriver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): jdbc.idbDriver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): org.gjt.mm.mysql.Driver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): com.mckoi.JDBCDriver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): org.hsqldb.jdbcDriver - Warning, not in CLASSPATH?
[KnowledgeFlow] Loading properties and plugins...
[KnowledgeFlow] Initializing KF...

回复收藏 0 原文

中性美 2024-12-14 07:09:05

Snowball 词干分析器 (snowball.jar) 找不到 weka.jar 文件。

在您的计算机上，您需要搜索名为 weka.jar 的文件。在我的 Linux 系统上，它位于

/usr/local/lib/R/site-library/RWekajars/java/weka.jar

然后，在 R 代码中添加与顶部类似的行：

wekajar="/usr/local/lib/R/site-library/RWekajars/java/weka.jar"
oldcp=Sys.getenv("CLASSPATH")
newcp=NULL
Sys.setenv(CLASSPATH=paste(wekajar,newcp, sep=":"))

library("tm")    
data("crude")
stemDocument(crude[[1]], language = "english" )

这会将 R 会话的 Java CLASSPATH 设置为 weka.jar上面的文件。不过，您现有的类路径将被重置。如果您有一些并且需要的话，您可以尝试将旧条目添加回来。

Snowball stemmer (snowball.jar) cannot find the weka.jar file.

On your computer, you need to search for a file called weka.jar . On my linux system, it is located in

/usr/local/lib/R/site-library/RWekajars/java/weka.jar

Then, in your R code, add lines similar to these at the top:

wekajar="/usr/local/lib/R/site-library/RWekajars/java/weka.jar"
oldcp=Sys.getenv("CLASSPATH")
newcp=NULL
Sys.setenv(CLASSPATH=paste(wekajar,newcp, sep=":"))

library("tm")    
data("crude")
stemDocument(crude[[1]], language = "english" )

This sets the Java CLASSPATH for the R Session to the weka.jar file from above . Your existing classpath will be reset, though. You can try to add the old entries back if you have some , and if you need them.

回复收藏 0 原文

~没有更多了~