R 中的基本词干提取代替根词干提取

发布于 2024-11-20 00:41:28 字数 684 浏览 7 评论 0原文

有没有什么方法可以在 R 中使用 NLP 来获取词干而不是根词？

代码：

> #Loading libraries
> library(tm)
> library(slam)
> 
> #Vector
> Vec=c("happyness happies happys","sky skies")
> 
> #Creating Corpus
> Txt=Corpus(VectorSource(Vec))
> 
> #Stemming
> Txt=tm_map(Txt, stemDocument)
> 
> #Checking result
> inspect(Txt)
A corpus with 2 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
happi happi happi

[[2]]
sky sky

>

我可以使用 R 获得“happyness happies happys”的基本词“happy”（基本词）而不是“happi”（根词）吗？

原文

Is there any way to get base word instead of root word in stemming using NLP in R?

Code:

> #Loading libraries
> library(tm)
> library(slam)
> 
> #Vector
> Vec=c("happyness happies happys","sky skies")
> 
> #Creating Corpus
> Txt=Corpus(VectorSource(Vec))
> 
> #Stemming
> Txt=tm_map(Txt, stemDocument)
> 
> #Checking result
> inspect(Txt)
A corpus with 2 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
happi happi happi

[[2]]
sky sky

>

Can I get base word "happy" (base word) instead of "happi" (root word) for "happyness happies happys" using R.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

若沐 2024-11-27 00:41:29

当我需要做类似的事情时，我在一个文本文件中写下了单词列表，并将其输入英语词典项目的网络查询工具，然后将结果解析回 R。有点笨拙，但有很多好的数据可从 ELP 获取。
如需使用，请查看 ELP 的 MorphSP。对于幸福，它给予{幸福}>幸福>

http://elexicon.wustl.edu/query14/query14.asp

回复收藏 0 原文

挽清梦 2024-11-27 00:41:28

您可能正在寻找词干分析器。
以下是来自 CRAN 任务视图：自然语言处理的一些词干提取器：

RWeka 是 Weka 的接口，Weka 是机器的集合学习算法用于用 Java 编写的数据挖掘任务。在自然语言处理环境中特别有用的是其标记化和词干提取功能。
Snowball 提供包含 Porter 词干分析器的 Snowball 词干分析器以及其他几个针对不同语言的词干分析器。有关详细信息，请参阅 Snowball 网页。
Rstem 是 C 版本的替代接口Porter 的词干算法。

回复收藏 0 原文

往日 2024-11-27 00:41:28

如果没有很好的英语词法知识，您将不得不使用现有的库，而不是创建自己的词干分析器。

英语充满了意想不到的形态惊喜，这些惊喜会影响概率模型和基于规则的模型。一些示例是：

删除 in- 前缀以删除 -able 后缀，如 inhabitable。
单词类别的更改，如名词 bicycle 中由动词 bicycling 词干产生的情况（可能会影响基于类别的规则）。
具有负面含义的单词不能使用负面前缀（可以使用unpretty，但不能使用unugly）。
两个词作为一个复合词，如“卡车司机”（当你词干时，你会将它们视为一个词）。

英语还存在 I-umlaut 问题，其中诸如 men、geese、foot、best 和许多其他单词（均带有类似“e”的声音）之类的单词无法轻易被词干。阻止外来借用词，例如automaton，也可能是一个问题。

词干最高级形式是例外的一个很好的例子：

best ->;好

大姐->旧

词形还原器可以解释此类异常，但速度会较慢。您可以查看 Porter 词干分析器规则来了解您需要什么，或者您也可以只使用它的 SnowballC R 包。

回复收藏 0 原文

我一向站在原地 2024-11-27 00:41:28

这里可以使用stemCompletion。这不是最好的，但可以管理。

Stemm = tm_map(Txt, stemCompletion, dictionary=Txtt)
inspect(Stemm)

A corpus with 2 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
happyness happies happies

[[2]]
sky sky

stemCompletion could be used here. It's not the best one but manageable.

Stemm = tm_map(Txt, stemCompletion, dictionary=Txtt)
inspect(Stemm)

A corpus with 2 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
happyness happies happies

[[2]]
sky sky

回复收藏 0 原文

~没有更多了~