R 中的基本词干提取代替根词干提取

发布于 2024-11-20 00:41:28 字数 684 浏览 7 评论 0原文

有没有什么方法可以在 R 中使用 NLP 来获取词干而不是根词?

代码:

> #Loading libraries
> library(tm)
> library(slam)
> 
> #Vector
> Vec=c("happyness happies happys","sky skies")
> 
> #Creating Corpus
> Txt=Corpus(VectorSource(Vec))
> 
> #Stemming
> Txt=tm_map(Txt, stemDocument)
> 
> #Checking result
> inspect(Txt)
A corpus with 2 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
happi happi happi

[[2]]
sky sky

> 

我可以使用 R 获得“happyness happies happys”的基本词“happy”(基本词)而不是“happi”(根词)吗?

Is there any way to get base word instead of root word in stemming using NLP in R?

Code:

> #Loading libraries
> library(tm)
> library(slam)
> 
> #Vector
> Vec=c("happyness happies happys","sky skies")
> 
> #Creating Corpus
> Txt=Corpus(VectorSource(Vec))
> 
> #Stemming
> Txt=tm_map(Txt, stemDocument)
> 
> #Checking result
> inspect(Txt)
A corpus with 2 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
happi happi happi

[[2]]
sky sky

> 

Can I get base word "happy" (base word) instead of "happi" (root word) for "happyness happies happys" using R.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

若沐 2024-11-27 00:41:29

当我需要做类似的事情时,我在一个文本文件中写下了单词列表,并将其输入英语词典项目的网络查询工具,然后将结果解析回 R。有点笨拙,但有很多好的数据可从 ELP 获取。
如需使用,请查看 ELP 的 MorphSP。对于幸福,它给予{幸福}>幸福>

http://elexicon.wustl.edu/query14/query14.asp

When I needed to do something similar, I wrote out my list of words in a text file, and fed it to the English Lexicon Project's web query tool, then parsed the result back into R. A little clunky, but lots of good data is available from ELP.
For your use, Check out ELP's MorphSP. For happiness, it gives {happy}>ness>

http://elexicon.wustl.edu/query14/query14.asp

挽清梦 2024-11-27 00:41:28

您可能正在寻找词干分析器。
以下是来自 CRAN 任务视图:自然语言处理的一些词干提取器:

  • RWeka 是 Weka 的接口,Weka 是机器的集合学习算法用于用 Java 编写的数据挖掘任务。在自然语言处理环境中特别有用的是其标记化和词干提取功能。

  • Snowball 提供包含 Porter 词干分析器的 Snowball 词干分析器以及其他几个针对不同语言的词干分析器。有关详细信息,请参阅 Snowball 网页。

  • Rstem 是 C 版本的替代接口Porter 的词干算法。

You're probably looking for a stemmer.
Here are some stemmers from CRAN Task View: Natural Language Processing:

  • RWeka is a interface to Weka which is a collection of machine learning algorithms for data mining tasks written in Java. Especially useful in the context of natural language processing is its functionality for tokenization and stemming.

  • Snowball provides the Snowball stemmers which contain the Porter stemmer and several other stemmers for different languages. See the Snowball webpage for details.

  • Rstem is an alternative interface to a C version of Porter's word stemming algorithm.

往日 2024-11-27 00:41:28

如果没有很好的英语词法知识,您将不得不使用现有的库,而不是创建自己的词干分析器。

英语充满了意想不到的形态惊喜,这些惊喜会影响概率模型和基于规则的模型。一些示例是:

  • 删除 in- 前缀以删除 -able 后缀,如 inhabitable
  • 单词类别的更改,如名词 bicycle 中由动词 bicycling 词干产生的情况(可能会影响基于类别的规则)。
  • 具有负面含义的单词不能使用负面前缀(可以使用unpretty,但不能使用unugly)。
  • 两个词作为一个复合词,如“卡车司机”(当你词干时,你会将它们视为一个词)。

英语还存在 I-umlaut 问题,其中诸如 men、geese、foot、best 和许多其他单词(均带有类似“e”的声音)之类的单词无法轻易被词干。阻止外来借用词,例如automaton,也可能是一个问题。

词干最高级形式是例外的一个很好的例子:

best ->;好

大姐->旧

词形还原器可以解释此类异常,但速度会较慢。您可以查看 Porter 词干分析器规则来了解您需要什么,或者您也可以只使用它的 SnowballC R 包。

Without a good knowledge of English morphology, you would have to use an existing library rather than create your own stemmer.

English is full of unexpected morphological surprises that would affect both probabilistic and rule-based models. Some examples are:

  • Removing an in- prefix to remove an -able suffix, as in inhabitable.
  • Change of the word's category, as in the noun bicycle resulting from stemming the verb bicycling (can affect rules based on categories).
  • Words with negative meanings cannot take negative prefixes (you can have unpretty, but not unugly).
  • Two words as a compound, as in "truck driver" (you would treat them as one word when you stem).

English also has an issue with I-umlaut, where words like men, geese, feet, best, and a host of other words (all with an 'e'-like sound) cannot be easily stemmed. Stemming foreign, borrowed words, like automaton, may also be an issue.

Stemming the superlative form is a good example of exceptions:

best -> good

eldest -> old

A lemmatizer would account for such exceptions, but would be slower. You can look at the Porter stemmer rules to get an idea of what you need, or you can just use its SnowballC R package.

我一向站在原地 2024-11-27 00:41:28

这里可以使用stemCompletion。这不是最好的,但可以管理。

Stemm = tm_map(Txt, stemCompletion, dictionary=Txtt)
inspect(Stemm)

A corpus with 2 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
happyness happies happies

[[2]]
sky sky

stemCompletion could be used here. It's not the best one but manageable.

Stemm = tm_map(Txt, stemCompletion, dictionary=Txtt)
inspect(Stemm)

A corpus with 2 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
happyness happies happies

[[2]]
sky sky
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文