如何进行词干提取或词形还原?

发布于 2024-07-16 23:06:49 字数 335 浏览 11 评论 0原文

我尝试过 PorterStemmer 和 Snowball,但两者都不能处理所有单词,缺少一些非常常见的单词。

我的测试词是:“猫跑跑仙人掌仙人掌仙人掌社区社区”,两者都答对了一半。

另请参阅:

I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.

My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.

See also:

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(22

那小子欠揍 2024-07-23 23:06:49

如果您了解 Python,自然语言工具包 (NLTK) 有一个非常强大的词形还原器,它利用 WordNet

请注意,如果您是第一次使用此词形还原器,则必须在使用之前下载语料库。 这可以通过以下方式完成:

>>> import nltk
>>> nltk.download('wordnet')

您只需执行一次。 假设您现在已经下载了语料库,它的工作原理如下:

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'
>>> lmtzr.lemmatize('feet')
'foot'
>>> lmtzr.lemmatize('people')
'people'
>>> lmtzr.lemmatize('fantasized','v')
'fantasize'

nltk 中还有其他词形还原器.stem模块,但我自己还没有尝试过。

If you know Python, The Natural Language Toolkit (NLTK) has a very powerful lemmatizer that makes use of WordNet.

Note that if you are using this lemmatizer for the first time, you must download the corpus prior to using it. This can be done by:

>>> import nltk
>>> nltk.download('wordnet')

You only have to do this once. Assuming that you have now downloaded the corpus, it works like this:

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'
>>> lmtzr.lemmatize('feet')
'foot'
>>> lmtzr.lemmatize('people')
'people'
>>> lmtzr.lemmatize('fantasized','v')
'fantasize'

There are other lemmatizers in the nltk.stem module, but I haven't tried them myself.

物价感观 2024-07-23 23:06:49

我使用 stanford nlp 来执行词形还原。 在过去的几天里,我一直被类似的问题困扰。 非常感谢 stackoverflow 帮我解决了这个问题。

import java.util.*; 
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*; 
import edu.stanford.nlp.ling.CoreAnnotations.*;  

public class example
{
    public static void main(String[] args)
    {
        Properties props = new Properties(); 
        props.put("annotators", "tokenize, ssplit, pos, lemma"); 
        pipeline = new StanfordCoreNLP(props, false);
        String text = /* the string you want */; 
        Annotation document = pipeline.process(text);  

        for(CoreMap sentence: document.get(SentencesAnnotation.class))
        {    
            for(CoreLabel token: sentence.get(TokensAnnotation.class))
            {       
                String word = token.get(TextAnnotation.class);      
                String lemma = token.get(LemmaAnnotation.class); 
                System.out.println("lemmatized version :" + lemma);
            }
        }
    }
}

如果稍后在分类器中使用,使用停用词来最小化输出引理也可能是一个好主意。 请查看 John Conwell 编写的 coreNlp 扩展。

I use stanford nlp to perform lemmatization. I have been stuck up with a similar problem in the last few days. All thanks to stackoverflow to help me solve the issue .

import java.util.*; 
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*; 
import edu.stanford.nlp.ling.CoreAnnotations.*;  

public class example
{
    public static void main(String[] args)
    {
        Properties props = new Properties(); 
        props.put("annotators", "tokenize, ssplit, pos, lemma"); 
        pipeline = new StanfordCoreNLP(props, false);
        String text = /* the string you want */; 
        Annotation document = pipeline.process(text);  

        for(CoreMap sentence: document.get(SentencesAnnotation.class))
        {    
            for(CoreLabel token: sentence.get(TokensAnnotation.class))
            {       
                String word = token.get(TextAnnotation.class);      
                String lemma = token.get(LemmaAnnotation.class); 
                System.out.println("lemmatized version :" + lemma);
            }
        }
    }
}

It also might be a good idea to use stopwords to minimize output lemmas if it's used later in classificator. Please take a look at coreNlp extension written by John Conwell.

弱骨蛰伏 2024-07-23 23:06:49

我在这个雪球演示网站上尝试了您的术语列表,结果看起来不错......

  • 猫 - > 猫
  • 跑-> 跑跑
  • -> 跑
  • 仙人掌-> 仙人掌 仙人掌
  • -> 仙人掌
  • 社区-> 社区
  • -> 社区 Community

词干提取器应该将单词的变形形式转化为某个共同的词根。 让词根成为“正确的”字典单词并不是词干分析器的工作。 为此,您需要查看形态/正交分析器< /a>.

我认为这个问题或多或少是同一件事,Kaarel 的回答这个问题是我从哪里获取第二个链接的。

I tried your list of terms on this snowball demo site and the results look okay....

  • cats -> cat
  • running -> run
  • ran -> ran
  • cactus -> cactus
  • cactuses -> cactus
  • community -> communiti
  • communities -> communiti

A stemmer is supposed to turn inflected forms of words down to some common root. It's not really a stemmer's job to make that root a 'proper' dictionary word. For that you need to look at morphological/orthographic analysers.

I think this question is about more or less the same thing, and Kaarel's answer to that question is where I took the second link from.

陌若浮生 2024-07-23 23:06:49

词干提取器与词形还原器的争论仍在继续。 这是一个优先考虑精度而非效率的问题。 您应该进行词形还原以获得具有语言意义的单位,并使用最少的计算能力,并且仍然在同一键下索引单词及其变体。

请参阅 Stemmers 与 Lemmatizers

下面是一个使用 python NLTK 的示例:

>>> sent = "cats running ran cactus cactuses cacti community communities"
>>> from nltk.stem import PorterStemmer, WordNetLemmatizer
>>>
>>> port = PorterStemmer()
>>> " ".join([port.stem(i) for i in sent.split()])
'cat run ran cactu cactus cacti commun commun'
>>>
>>> wnl = WordNetLemmatizer()
>>> " ".join([wnl.lemmatize(i) for i in sent.split()])
'cat running ran cactus cactus cactus community community'

The stemmer vs lemmatizer debates goes on. It's a matter of preferring precision over efficiency. You should lemmatize to achieve linguistically meaningful units and stem to use minimal computing juice and still index a word and its variations under the same key.

See Stemmers vs Lemmatizers

Here's an example with python NLTK:

>>> sent = "cats running ran cactus cactuses cacti community communities"
>>> from nltk.stem import PorterStemmer, WordNetLemmatizer
>>>
>>> port = PorterStemmer()
>>> " ".join([port.stem(i) for i in sent.split()])
'cat run ran cactu cactus cacti commun commun'
>>>
>>> wnl = WordNetLemmatizer()
>>> " ".join([wnl.lemmatize(i) for i in sent.split()])
'cat running ran cactus cactus cactus community community'
荒岛晴空 2024-07-23 23:06:49

Martin Porter 的官方页面包含 Porter Stemmer in PHP 以及 其他语言

如果您真的很重视良好的词干提取,那么您需要从波特算法之类的东西开始,通过添加规则来修复数据集常见的错误情况来完善它,然后最后在规则中添加很多例外情况。 这可以通过键/值对(dbm/散列/字典)轻松实现,其中键是要查找的单词,值是替换原始单词的词干单词。 我曾经工作过的一个商业搜索引擎最终出现了 800 个修改后的波特算法的例外情况。

Martin Porter's official page contains a Porter Stemmer in PHP as well as other languages.

If you're really serious about good stemming though you're going to need to start with something like the Porter Algorithm, refine it by adding rules to fix incorrect cases common to your dataset, and then finally add a lot of exceptions to the rules. This can be easily implemented with key/value pairs (dbm/hash/dictionaries) where the key is the word to look up and the value is the stemmed word to replace the original. A commercial search engine I worked on once ended up with 800 some exceptions to a modified Porter algorithm.

耳钉梦 2024-07-23 23:06:49

根据我遇到的 Stack Overflow 和博客上的各种答案,这就是我正在使用的方法,并且它似乎可以很好地返回真实单词。 其想法是将传入的文本拆分为单词数组(使用您想要的任何方法),然后找到这些单词的词性 (POS),并使用它来帮助词干和词形还原。

您上面的示例效果不太好,因为无法确定 POS。 然而,如果我们使用真正的句子,事情就会好得多。

import nltk
from nltk.corpus import wordnet

lmtzr = nltk.WordNetLemmatizer().lemmatize


def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN


def normalize_text(text):
    word_pos = nltk.pos_tag(nltk.word_tokenize(text))
    lemm_words = [lmtzr(sw[0], get_wordnet_pos(sw[1])) for sw in word_pos]

    return [x.lower() for x in lemm_words]

print(normalize_text('cats running ran cactus cactuses cacti community communities'))
# ['cat', 'run', 'ran', 'cactus', 'cactuses', 'cacti', 'community', 'community']

print(normalize_text('The cactus ran to the community to see the cats running around cacti between communities.'))
# ['the', 'cactus', 'run', 'to', 'the', 'community', 'to', 'see', 'the', 'cat', 'run', 'around', 'cactus', 'between', 'community', '.']

Based on various answers on Stack Overflow and blogs I've come across, this is the method I'm using, and it seems to return real words quite well. The idea is to split the incoming text into an array of words (use whichever method you'd like), and then find the parts of speech (POS) for those words and use that to help stem and lemmatize the words.

You're sample above doesn't work too well, because the POS can't be determined. However, if we use a real sentence, things work much better.

import nltk
from nltk.corpus import wordnet

lmtzr = nltk.WordNetLemmatizer().lemmatize


def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN


def normalize_text(text):
    word_pos = nltk.pos_tag(nltk.word_tokenize(text))
    lemm_words = [lmtzr(sw[0], get_wordnet_pos(sw[1])) for sw in word_pos]

    return [x.lower() for x in lemm_words]

print(normalize_text('cats running ran cactus cactuses cacti community communities'))
# ['cat', 'run', 'ran', 'cactus', 'cactuses', 'cacti', 'community', 'community']

print(normalize_text('The cactus ran to the community to see the cats running around cacti between communities.'))
# ['the', 'cactus', 'run', 'to', 'the', 'community', 'to', 'see', 'the', 'cat', 'run', 'around', 'cactus', 'between', 'community', '.']
铁憨憨 2024-07-23 23:06:49

http://wordnet.princeton.edu/man/morph.3WN

对于很多在我的项目中,我更喜欢基于词典的 WordNet 词形还原器,而不是更激进的波特词干提取。

http://wordnet.princeton.edu/links#PHP 有一个 PHP 接口的链接WN API。

http://wordnet.princeton.edu/man/morph.3WN

For a lot of my projects, I prefer the lexicon-based WordNet lemmatizer over the more aggressive porter stemming.

http://wordnet.princeton.edu/links#PHP has a link to a PHP interface to the WN APIs.

锦上情书 2024-07-23 23:06:49

查看 WordNet,一个大型英语词汇数据库:

http://wordnet.princeton.edu/

有多种语言的 API 可以访问它。

Look into WordNet, a large lexical database for the English language:

http://wordnet.princeton.edu/

There are APIs for accessing it in several languages.

深爱成瘾 2024-07-23 23:06:49

看一下 LemmaGen - 用 C# 3.0 编写的开源库。

您的测试词 (http://lemmatise.ijs.si/Services) 的结果

  • cats -> 猫
  • 跑跑
  • -> 运行
  • 仙人掌仙人掌
  • -> 仙人掌
  • 仙人掌 -> 仙人掌
  • 社区
  • 社区-> 社区

Take a look at LemmaGen - open source library written in C# 3.0.

Results for your test words (http://lemmatise.ijs.si/Services)

  • cats -> cat
  • running
  • ran -> run
  • cactus
  • cactuses -> cactus
  • cacti -> cactus
  • community
  • communities -> community
葬シ愛 2024-07-23 23:06:49

用于词形还原的顶级 Python 包(排名不分先后)是:spacy、nltk、gensim、pattern、 CoreNLPTextBlob。 我更喜欢 spaCy 和 gensim 的实现(基于模式),因为它们识别单词的 POS 标签并自动分配适当的引理。 给出了更多相关的引理,保持了含义的完整。

如果您计划使用 nltk 或 TextBlob,则需要手动查找正确的 POS 标签并找到正确的引理。

使用 spaCy 的词形还原示例:

# Run below statements in terminal once. 
pip install spacy
spacy download en

import spacy

# Initialize spacy 'en' model
nlp = spacy.load('en', disable=['parser', 'ner'])

sentence = "The striped bats are hanging on their feet for best"

# Parse
doc = nlp(sentence)

# Extract the lemma
" ".join([token.lemma_ for token in doc])
#> 'the strip bat be hang on -PRON- foot for good'

使用 Gensim 的词形还原示例:

from gensim.utils import lemmatize
sentence = "The striped bats were hanging on their feet and ate best fishes"
lemmatized_out = [wd.decode('utf-8').split('/')[0] for wd in lemmatize(sentence)]
#> ['striped', 'bat', 'be', 'hang', 'foot', 'eat', 'best', 'fish']

上面的示例借自此 词形还原 页面。

The top python packages (in no specific order) for lemmatization are: spacy, nltk, gensim, pattern, CoreNLP and TextBlob. I prefer spaCy and gensim's implementation (based on pattern) because they identify the POS tag of the word and assigns the appropriate lemma automatically. The gives more relevant lemmas, keeping the meaning intact.

If you plan to use nltk or TextBlob, you need to take care of finding the right POS tag manually and the find the right lemma.

Lemmatization Example with spaCy:

# Run below statements in terminal once. 
pip install spacy
spacy download en

import spacy

# Initialize spacy 'en' model
nlp = spacy.load('en', disable=['parser', 'ner'])

sentence = "The striped bats are hanging on their feet for best"

# Parse
doc = nlp(sentence)

# Extract the lemma
" ".join([token.lemma_ for token in doc])
#> 'the strip bat be hang on -PRON- foot for good'

Lemmatization Example With Gensim:

from gensim.utils import lemmatize
sentence = "The striped bats were hanging on their feet and ate best fishes"
lemmatized_out = [wd.decode('utf-8').split('/')[0] for wd in lemmatize(sentence)]
#> ['striped', 'bat', 'be', 'hang', 'foot', 'eat', 'best', 'fish']

The above examples were borrowed from in this lemmatization page.

海未深 2024-07-23 23:06:49

如果我可以引用我对 StompChicken 提到的问题的回答:

这里的核心问题是词干算法在语音基础上运行,没有真正理解他们正在使用的语言。

因为他们没有理解由于它们不是语言的一部分并且不是从术语词典中运行的,因此它们无法识别和适当响应不规则的情况,例如“run”/“ran”。

如果您需要处理不规则的情况,则需要选择不同的方法或使用您自己的自定义更正字典来增强您的词干提取,以便在词干分析器完成其工作后运行。

If I may quote my answer to the question StompChicken mentioned:

The core issue here is that stemming algorithms operate on a phonetic basis with no actual understanding of the language they're working with.

As they have no understanding of the language and do not run from a dictionary of terms, they have no way of recognizing and responding appropriately to irregular cases, such as "run"/"ran".

If you need to handle irregular cases, you'll need to either choose a different approach or augment your stemming with your own custom dictionary of corrections to run after the stemmer has done its thing.

初吻给了烟 2024-07-23 23:06:49

NLTK 中词干分析器的最新版本是 Snowball。

您可以在此处找到有关如何使用它的示例:

http://nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.snowball2-pysrc.html#demo

The most current version of the stemmer in NLTK is Snowball.

You can find examples on how to use it here:

http://nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.snowball2-pysrc.html#demo

剑心龙吟 2024-07-23 23:06:49

您可以使用 Morpha 词干分析器。 如果您打算使用,华盛顿大学已将 Morpha 词干分析器上传到 Maven 中心它来自 Java 应用程序。 有一个包装器可以使它更容易使用。 您只需将其添加为依赖项并使用 edu.washington.cs.knowitall.morpha.MorphaStemmer 类即可。 实例是线程安全的(最初的 JFlex 有不必要的局部变量的类字段)。 实例化一个类并运行 morpha 和您想要提取词干的单词。

new MorphaStemmer().morpha("climbed") // goes to "climb"

You could use the Morpha stemmer. UW has uploaded morpha stemmer to Maven central if you plan to use it from a Java application. There's a wrapper that makes it much easier to use. You just need to add it as a dependency and use the edu.washington.cs.knowitall.morpha.MorphaStemmer class. Instances are threadsafe (the original JFlex had class fields for local variables unnecessarily). Instantiate a class and run morpha and the word you want to stem.

new MorphaStemmer().morpha("climbed") // goes to "climb"
哽咽笑 2024-07-23 23:06:49

搜索 Lucene,我不确定是否有 PHP 端口,但我知道 Lucene 可用于许多平台。 Lucene 是一个 OSS(来自 Apache)索引和搜索库。 当然,它和社区的额外内容可能会有一些有趣的东西值得一看。 至少您可以了解它是如何用一种语言完成的,这样您就可以将“想法”翻译成 PHP。

Do a search for Lucene, im not sure if theres a PHP port but I do know Lucene is available for many platforms. Lucene is an OSS (from Apache) indexing and search library. Naturally it and community extras might have something interesting to look at. At the very least you can learn how it's done in one language so you can translate the "idea" into PHP.

初熏 2024-07-23 23:06:49

.Net lucene 有一个内置的 porter 词干分析器。 你可以尝试一下。 但请注意,波特词干提取在推导引理时不考虑单词上下文。 (浏览一下算法及其实现,你就会明白它是如何工作的)

.Net lucene has an inbuilt porter stemmer. You can try that. But note that porter stemming does not consider word context when deriving the lemma. (Go through the algorithm and its implementation and you will see how it works)

暮倦 2024-07-23 23:06:49

Martin Porter 编写了 Snowball(一种词干算法语言),并在 Snowball 中重写了“英语词干分析器”。 C 和 Java 有一个英语词干分析器。

他明确指出,Porter Stemmer 由于历史原因而被重新实现,因此针对 Porter Stemmer 测试词干正确性将得到您(应该)已经知道的结果。

来自http://tartarus.org/~martin/PorterStemmer/index.html< /a>(强调我的)

Porter 词干分析器应被视为“冻结”,即严格定义,并且不适合进一步修改。 作为词干分析器,它略逊于从它派生的 Snowball English 或 Porter2 词干分析器,并且偶尔会进行改进。 因此,对于实际工作,建议使用新的 Snowball 词干分析器。 Porter 词干分析器适用于涉及词干提取的 IR 研究工作,其中实验需要完全可重复。

Porter 博士建议使用 English 或 Porter2 词干分析器而不是 Porter 词干分析器。 英语词干分析器是 演示网站 中实际使用的 @StompChicken之前已经回答过。

Martin Porter wrote Snowball (a language for stemming algorithms) and rewrote the "English Stemmer" in Snowball. There are is an English Stemmer for C and Java.

He explicitly states that the Porter Stemmer has been reimplemented only for historical reasons, so testing stemming correctness against the Porter Stemmer will get you results that you (should) already know.

From http://tartarus.org/~martin/PorterStemmer/index.html (emphasis mine)

The Porter stemmer should be regarded as ‘frozen’, that is, strictly defined, and not amenable to further modification. As a stemmer, it is slightly inferior to the Snowball English or Porter2 stemmer, which derives from it, and which is subjected to occasional improvements. For practical work, therefore, the new Snowball stemmer is recommended. The Porter stemmer is appropriate to IR research work involving stemming where the experiments need to be exactly repeatable.

Dr. Porter suggests to use the English or Porter2 stemmers instead of the Porter stemmer. The English stemmer is what's actually used in the demo site as @StompChicken has answered earlier.

街角迷惘 2024-07-23 23:06:49

在 Java 中,我使用 tartargus-snowball 来词干

Maven:

<dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-snowball</artifactId>
        <version>3.0.3</version>
        <scope>test</scope>
</dependency>

<示例代码:

SnowballProgram stemmer = new EnglishStemmer();
String[] words = new String[]{
    "testing",
    "skincare",
    "eyecare",
    "eye",
    "worked",
    "read"
};
for (String word : words) {
    stemmer.setCurrent(word);
    stemmer.stem();
    //debug
    logger.info("Origin: " + word + " > " + stemmer.getCurrent());// result: test, skincar, eyecar, eye, work, read
}

In Java, i use tartargus-snowball to stemming words

Maven:

<dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-snowball</artifactId>
        <version>3.0.3</version>
        <scope>test</scope>
</dependency>

Sample code:

SnowballProgram stemmer = new EnglishStemmer();
String[] words = new String[]{
    "testing",
    "skincare",
    "eyecare",
    "eye",
    "worked",
    "read"
};
for (String word : words) {
    stemmer.setCurrent(word);
    stemmer.stem();
    //debug
    logger.info("Origin: " + word + " > " + stemmer.getCurrent());// result: test, skincar, eyecar, eye, work, read
}
顾冷 2024-07-23 23:06:49

在这里试试这个:http://www.twinword.com/lemmatizer.php

我输入了您的在演示中查询“cats running ran cactus cactuses cacti Community communications”并得到["cat", "running", "run", "cactus", "cactus", "cactus" , "community", "community"] 带有可选标志 ALL_TOKENS

示例代码

这是一个 API,因此您可以从任何环境连接到它。 PHP REST 调用可能如下所示。

// These code snippets use an open-source library. http://unirest.io/php
$response = Unirest\Request::post([ENDPOINT],
  array(
    "X-Mashape-Key" => [API KEY],
    "Content-Type" => "application/x-www-form-urlencoded",
    "Accept" => "application/json"
  ),
  array(
    "text" => "cats running ran cactus cactuses cacti community communities"
  )
);

Try this one here: http://www.twinword.com/lemmatizer.php

I entered your query in the demo "cats running ran cactus cactuses cacti community communities" and got ["cat", "running", "run", "cactus", "cactus", "cactus", "community", "community"] with the optional flag ALL_TOKENS.

Sample Code

This is an API so you can connect to it from any environment. Here is what the PHP REST call may look like.

// These code snippets use an open-source library. http://unirest.io/php
$response = Unirest\Request::post([ENDPOINT],
  array(
    "X-Mashape-Key" => [API KEY],
    "Content-Type" => "application/x-www-form-urlencoded",
    "Accept" => "application/json"
  ),
  array(
    "text" => "cats running ran cactus cactuses cacti community communities"
  )
);
樱娆 2024-07-23 23:06:49

我强烈建议使用 Spacy (基本文本解析和标记)和 Textacy(建立在 Spacy 之上的更高级别的文本处理)。

词形还原的单词默认情况下在 Spacy 中可用作为标记的 .lemma_在使用 textacy 进行许多其他文本预处理时,可以对属性和文本进行词形还原。 例如创建术语包时 或单词或通常在执行某些操作之前需要它的处理。

我鼓励您在编写任何代码之前检查两者,因为这可能会节省您很多时间!

I highly recommend using Spacy (base text parsing & tagging) and Textacy (higher level text processing built on top of Spacy).

Lemmatized words are available by default in Spacy as a token's .lemma_ attribute and text can be lemmatized while doing a lot of other text preprocessing with textacy. For example while creating a bag of terms or words or generally just before performing some processing that requires it.

I'd encourage you to check out both before writing any code, as this may save you a lot of time!

倾听心声的旋律 2024-07-23 23:06:49
import re
import pymorphy2
from pymorphy2 import MorphAnalyzer
import nltk
from nltk.tokenize import  word_tokenize
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')

stopwords_ru = stopwords.words("russian")
morph = MorphAnalyzer()

def to_lowercase(data):
    
    data = data.lower()
return data

def noise_remove(data, remove_numbers=True):
    
    data = re.sub(r"(\w+:\/\/\S+)", " ", data)

    
    data = re.sub(r"([^0-9A-Za-zА-Яа-я])", " ", data)


if remove_numbers:
    data = re.sub(r"\d+", " ", data)
return data


def lemmatize(words):
    text = []
    
    for word in words:
        morph_word = morph.parse(word)[0]
        if morph_word.tag.POS in ['NOUN', 'ADJF', 'INFN', 'PRTS'] and morph_word[2] not in stopwords_ru:
            text.append(morph_word[2])
   return text

def tokenize(text):
    words = text.split()
    for elem in words:
        if len(elem) < 3:
            words.remove(elem)
    lemmatize_words = lemmatize(words)
    return ' '.join(lemmatize_words)
import re
import pymorphy2
from pymorphy2 import MorphAnalyzer
import nltk
from nltk.tokenize import  word_tokenize
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')

stopwords_ru = stopwords.words("russian")
morph = MorphAnalyzer()

def to_lowercase(data):
    
    data = data.lower()
return data

def noise_remove(data, remove_numbers=True):
    
    data = re.sub(r"(\w+:\/\/\S+)", " ", data)

    
    data = re.sub(r"([^0-9A-Za-zА-Яа-я])", " ", data)


if remove_numbers:
    data = re.sub(r"\d+", " ", data)
return data


def lemmatize(words):
    text = []
    
    for word in words:
        morph_word = morph.parse(word)[0]
        if morph_word.tag.POS in ['NOUN', 'ADJF', 'INFN', 'PRTS'] and morph_word[2] not in stopwords_ru:
            text.append(morph_word[2])
   return text

def tokenize(text):
    words = text.split()
    for elem in words:
        if len(elem) < 3:
            words.remove(elem)
    lemmatize_words = lemmatize(words)
    return ' '.join(lemmatize_words)
一梦浮鱼 2024-07-23 23:06:49
df_plots = pd.read_excel("Plot Summary.xlsx", index_col = 0)
df_plots
# Printing first sentence of first row and last sentence of last row
nltk.sent_tokenize(df_plots.loc[1].Plot)[0] + nltk.sent_tokenize(df_plots.loc[len(df)].Plot)[-1]

# Calculating length of all plots by words
df_plots["Length"] = df_plots.Plot.apply(lambda x : 
len(nltk.word_tokenize(x)))

print("Longest plot is for season"),
print(df_plots.Length.idxmax())

print("Shortest plot is for season"),
print(df_plots.Length.idxmin())



#What is this show about? (What are the top 3 words used , excluding the #stop words, in all the #seasons combined)

word_sample = list(["struggled", "died"])
word_list = nltk.pos_tag(word_sample)
[wnl.lemmatize(str(word_list[index][0]), pos = word_list[index][1][0].lower()) for index in range(len(word_list))]

# Figure out the stop words
stop = (stopwords.words('english'))

# Tokenize all the plots
df_plots["Tokenized"] = df_plots.Plot.apply(lambda x : nltk.word_tokenize(x.lower()))

# Remove the stop words
df_plots["Filtered"] = df_plots.Tokenized.apply(lambda x : (word for word in x if word not in stop))

# Lemmatize each word
wnl = WordNetLemmatizer()
df_plots["POS"] = df_plots.Filtered.apply(lambda x : nltk.pos_tag(list(x)))
# df_plots["POS"] = df_plots.POS.apply(lambda x : ((word[1] = word[1][0] for word in word_list) for word_list in x))
df_plots["Lemmatized"] = df_plots.POS.apply(lambda x : (wnl.lemmatize(x[index][0], pos = str(x[index][1][0]).lower()) for index in range(len(list(x)))))



#Which Season had the highest screenplay of "Jesse" compared to "Walt" 
#Screenplay of Jesse =(Occurences of "Jesse")/(Occurences of "Jesse"+ #Occurences of "Walt")

df_plots.groupby("Season").Tokenized.sum()

df_plots["Share"] = df_plots.groupby("Season").Tokenized.sum().apply(lambda x : float(x.count("jesse") * 100)/float(x.count("jesse") + x.count("walter") + x.count("walt")))

print("The highest times Jesse was mentioned compared to Walter/Walt was in season"),
print(df_plots["Share"].idxmax())
#float(df_plots.Tokenized.sum().count('jesse')) * 100 / #float((df_plots.Tokenized.sum().count('jesse') + #df_plots.Tokenized.sum().count('walt') + #df_plots.Tokenized.sum().count('walter')))
df_plots = pd.read_excel("Plot Summary.xlsx", index_col = 0)
df_plots
# Printing first sentence of first row and last sentence of last row
nltk.sent_tokenize(df_plots.loc[1].Plot)[0] + nltk.sent_tokenize(df_plots.loc[len(df)].Plot)[-1]

# Calculating length of all plots by words
df_plots["Length"] = df_plots.Plot.apply(lambda x : 
len(nltk.word_tokenize(x)))

print("Longest plot is for season"),
print(df_plots.Length.idxmax())

print("Shortest plot is for season"),
print(df_plots.Length.idxmin())



#What is this show about? (What are the top 3 words used , excluding the #stop words, in all the #seasons combined)

word_sample = list(["struggled", "died"])
word_list = nltk.pos_tag(word_sample)
[wnl.lemmatize(str(word_list[index][0]), pos = word_list[index][1][0].lower()) for index in range(len(word_list))]

# Figure out the stop words
stop = (stopwords.words('english'))

# Tokenize all the plots
df_plots["Tokenized"] = df_plots.Plot.apply(lambda x : nltk.word_tokenize(x.lower()))

# Remove the stop words
df_plots["Filtered"] = df_plots.Tokenized.apply(lambda x : (word for word in x if word not in stop))

# Lemmatize each word
wnl = WordNetLemmatizer()
df_plots["POS"] = df_plots.Filtered.apply(lambda x : nltk.pos_tag(list(x)))
# df_plots["POS"] = df_plots.POS.apply(lambda x : ((word[1] = word[1][0] for word in word_list) for word_list in x))
df_plots["Lemmatized"] = df_plots.POS.apply(lambda x : (wnl.lemmatize(x[index][0], pos = str(x[index][1][0]).lower()) for index in range(len(list(x)))))



#Which Season had the highest screenplay of "Jesse" compared to "Walt" 
#Screenplay of Jesse =(Occurences of "Jesse")/(Occurences of "Jesse"+ #Occurences of "Walt")

df_plots.groupby("Season").Tokenized.sum()

df_plots["Share"] = df_plots.groupby("Season").Tokenized.sum().apply(lambda x : float(x.count("jesse") * 100)/float(x.count("jesse") + x.count("walter") + x.count("walt")))

print("The highest times Jesse was mentioned compared to Walter/Walt was in season"),
print(df_plots["Share"].idxmax())
#float(df_plots.Tokenized.sum().count('jesse')) * 100 / #float((df_plots.Tokenized.sum().count('jesse') + #df_plots.Tokenized.sum().count('walt') + #df_plots.Tokenized.sum().count('walter')))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文