如何进行词干提取或词形还原？

那小子欠揍 2024-07-23 23:06:49

如果您了解 Python，自然语言工具包 (NLTK) 有一个非常强大的词形还原器，它利用 WordNet。

请注意，如果您是第一次使用此词形还原器，则必须在使用之前下载语料库。这可以通过以下方式完成：

>>> import nltk
>>> nltk.download('wordnet')

您只需执行一次。假设您现在已经下载了语料库，它的工作原理如下：

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'
>>> lmtzr.lemmatize('feet')
'foot'
>>> lmtzr.lemmatize('people')
'people'
>>> lmtzr.lemmatize('fantasized','v')
'fantasize'

nltk 中还有其他词形还原器.stem模块，但我自己还没有尝试过。

If you know Python, The Natural Language Toolkit (NLTK) has a very powerful lemmatizer that makes use of WordNet.

Note that if you are using this lemmatizer for the first time, you must download the corpus prior to using it. This can be done by:

>>> import nltk
>>> nltk.download('wordnet')

You only have to do this once. Assuming that you have now downloaded the corpus, it works like this:

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'
>>> lmtzr.lemmatize('feet')
'foot'
>>> lmtzr.lemmatize('people')
'people'
>>> lmtzr.lemmatize('fantasized','v')
'fantasize'

There are other lemmatizers in the nltk.stem module, but I haven't tried them myself.

回复收藏 0 原文

物价感观 2024-07-23 23:06:49

我使用 stanford nlp 来执行词形还原。在过去的几天里，我一直被类似的问题困扰。非常感谢 stackoverflow 帮我解决了这个问题。

import java.util.*; 
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*; 
import edu.stanford.nlp.ling.CoreAnnotations.*;  

public class example
{
    public static void main(String[] args)
    {
        Properties props = new Properties(); 
        props.put("annotators", "tokenize, ssplit, pos, lemma"); 
        pipeline = new StanfordCoreNLP(props, false);
        String text = /* the string you want */; 
        Annotation document = pipeline.process(text);  

        for(CoreMap sentence: document.get(SentencesAnnotation.class))
        {    
            for(CoreLabel token: sentence.get(TokensAnnotation.class))
            {       
                String word = token.get(TextAnnotation.class);      
                String lemma = token.get(LemmaAnnotation.class); 
                System.out.println("lemmatized version :" + lemma);
            }
        }
    }
}

如果稍后在分类器中使用，使用停用词来最小化输出引理也可能是一个好主意。请查看 John Conwell 编写的 coreNlp 扩展。

I use stanford nlp to perform lemmatization. I have been stuck up with a similar problem in the last few days. All thanks to stackoverflow to help me solve the issue .

import java.util.*; 
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*; 
import edu.stanford.nlp.ling.CoreAnnotations.*;  

public class example
{
    public static void main(String[] args)
    {
        Properties props = new Properties(); 
        props.put("annotators", "tokenize, ssplit, pos, lemma"); 
        pipeline = new StanfordCoreNLP(props, false);
        String text = /* the string you want */; 
        Annotation document = pipeline.process(text);  

        for(CoreMap sentence: document.get(SentencesAnnotation.class))
        {    
            for(CoreLabel token: sentence.get(TokensAnnotation.class))
            {       
                String word = token.get(TextAnnotation.class);      
                String lemma = token.get(LemmaAnnotation.class); 
                System.out.println("lemmatized version :" + lemma);
            }
        }
    }
}

It also might be a good idea to use stopwords to minimize output lemmas if it's used later in classificator. Please take a look at coreNlp extension written by John Conwell.

回复收藏 0 原文

弱骨蛰伏 2024-07-23 23:06:49

我在这个雪球演示网站上尝试了您的术语列表，结果看起来不错......

猫 - > 猫
跑-> 跑跑
-> 跑
仙人掌-> 仙人掌仙人掌
-> 仙人掌
社区-> 社区
-> 社区 Community

词干提取器应该将单词的变形形式转化为某个共同的词根。让词根成为“正确的”字典单词并不是词干分析器的工作。为此，您需要查看形态/正交分析器< /a>.

我认为这个问题或多或少是同一件事，Kaarel 的回答这个问题是我从哪里获取第二个链接的。

回复收藏 0 原文

陌若浮生 2024-07-23 23:06:49

词干提取器与词形还原器的争论仍在继续。这是一个优先考虑精度而非效率的问题。您应该进行词形还原以获得具有语言意义的单位，并使用最少的计算能力，并且仍然在同一键下索引单词及其变体。

请参阅 Stemmers 与 Lemmatizers

下面是一个使用 python NLTK 的示例：

>>> sent = "cats running ran cactus cactuses cacti community communities"
>>> from nltk.stem import PorterStemmer, WordNetLemmatizer
>>>
>>> port = PorterStemmer()
>>> " ".join([port.stem(i) for i in sent.split()])
'cat run ran cactu cactus cacti commun commun'
>>>
>>> wnl = WordNetLemmatizer()
>>> " ".join([wnl.lemmatize(i) for i in sent.split()])
'cat running ran cactus cactus cactus community community'

The stemmer vs lemmatizer debates goes on. It's a matter of preferring precision over efficiency. You should lemmatize to achieve linguistically meaningful units and stem to use minimal computing juice and still index a word and its variations under the same key.

See Stemmers vs Lemmatizers

Here's an example with python NLTK:

>>> sent = "cats running ran cactus cactuses cacti community communities"
>>> from nltk.stem import PorterStemmer, WordNetLemmatizer
>>>
>>> port = PorterStemmer()
>>> " ".join([port.stem(i) for i in sent.split()])
'cat run ran cactu cactus cacti commun commun'
>>>
>>> wnl = WordNetLemmatizer()
>>> " ".join([wnl.lemmatize(i) for i in sent.split()])
'cat running ran cactus cactus cactus community community'

回复收藏 0 原文

荒岛晴空 2024-07-23 23:06:49

Martin Porter 的官方页面包含 Porter Stemmer in PHP 以及其他语言。

如果您真的很重视良好的词干提取，那么您需要从波特算法之类的东西开始，通过添加规则来修复数据集常见的错误情况来完善它，然后最后在规则中添加很多例外情况。这可以通过键/值对（dbm/散列/字典）轻松实现，其中键是要查找的单词，值是替换原始单词的词干单词。我曾经工作过的一个商业搜索引擎最终出现了 800 个修改后的波特算法的例外情况。

回复收藏 0 原文

耳钉梦 2024-07-23 23:06:49

根据我遇到的 Stack Overflow 和博客上的各种答案，这就是我正在使用的方法，并且它似乎可以很好地返回真实单词。其想法是将传入的文本拆分为单词数组（使用您想要的任何方法），然后找到这些单词的词性 (POS)，并使用它来帮助词干和词形还原。

您上面的示例效果不太好，因为无法确定 POS。然而，如果我们使用真正的句子，事情就会好得多。

import nltk
from nltk.corpus import wordnet

lmtzr = nltk.WordNetLemmatizer().lemmatize


def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN


def normalize_text(text):
    word_pos = nltk.pos_tag(nltk.word_tokenize(text))
    lemm_words = [lmtzr(sw[0], get_wordnet_pos(sw[1])) for sw in word_pos]

    return [x.lower() for x in lemm_words]

print(normalize_text('cats running ran cactus cactuses cacti community communities'))
# ['cat', 'run', 'ran', 'cactus', 'cactuses', 'cacti', 'community', 'community']

print(normalize_text('The cactus ran to the community to see the cats running around cacti between communities.'))
# ['the', 'cactus', 'run', 'to', 'the', 'community', 'to', 'see', 'the', 'cat', 'run', 'around', 'cactus', 'between', 'community', '.']

Based on various answers on Stack Overflow and blogs I've come across, this is the method I'm using, and it seems to return real words quite well. The idea is to split the incoming text into an array of words (use whichever method you'd like), and then find the parts of speech (POS) for those words and use that to help stem and lemmatize the words.

You're sample above doesn't work too well, because the POS can't be determined. However, if we use a real sentence, things work much better.

import nltk
from nltk.corpus import wordnet

lmtzr = nltk.WordNetLemmatizer().lemmatize


def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN


def normalize_text(text):
    word_pos = nltk.pos_tag(nltk.word_tokenize(text))
    lemm_words = [lmtzr(sw[0], get_wordnet_pos(sw[1])) for sw in word_pos]

    return [x.lower() for x in lemm_words]

print(normalize_text('cats running ran cactus cactuses cacti community communities'))
# ['cat', 'run', 'ran', 'cactus', 'cactuses', 'cacti', 'community', 'community']

print(normalize_text('The cactus ran to the community to see the cats running around cacti between communities.'))
# ['the', 'cactus', 'run', 'to', 'the', 'community', 'to', 'see', 'the', 'cat', 'run', 'around', 'cactus', 'between', 'community', '.']

回复收藏 0 原文

铁憨憨 2024-07-23 23:06:49

http://wordnet.princeton.edu/man/morph.3WN

对于很多在我的项目中，我更喜欢基于词典的 WordNet 词形还原器，而不是更激进的波特词干提取。

http://wordnet.princeton.edu/links#PHP 有一个 PHP 接口的链接WN API。

回复收藏 0 原文

锦上情书 2024-07-23 23:06:49

查看 WordNet，一个大型英语词汇数据库：

http://wordnet.princeton.edu/

有多种语言的 API 可以访问它。

回复收藏 0 原文

土豪我们做朋友吧 2024-07-23 23:06:49

这看起来很有趣：
麻省理工学院 Java WordnetStemmer：
http://projects.csail.mit.edu /jwi/api/edu/mit/jwi/morph/WordnetStemmer.html

回复收藏 0 原文

深爱成瘾 2024-07-23 23:06:49

看一下 LemmaGen - 用 C# 3.0 编写的开源库。

您的测试词 (http://lemmatise.ijs.si/Services) 的结果

cats -> 猫
跑跑
-> 运行
仙人掌仙人掌
-> 仙人掌
仙人掌 -> 仙人掌
社区
社区-> 社区

回复收藏 0 原文

葬シ愛 2024-07-23 23:06:49

用于词形还原的顶级 Python 包（排名不分先后）是：spacy、nltk、gensim、pattern、 CoreNLP 和 TextBlob。我更喜欢 spaCy 和 gensim 的实现（基于模式），因为它们识别单词的 POS 标签并自动分配适当的引理。给出了更多相关的引理，保持了含义的完整。

如果您计划使用 nltk 或 TextBlob，则需要手动查找正确的 POS 标签并找到正确的引理。

使用 spaCy 的词形还原示例：

# Run below statements in terminal once. 
pip install spacy
spacy download en

import spacy

# Initialize spacy 'en' model
nlp = spacy.load('en', disable=['parser', 'ner'])

sentence = "The striped bats are hanging on their feet for best"

# Parse
doc = nlp(sentence)

# Extract the lemma
" ".join([token.lemma_ for token in doc])
#> 'the strip bat be hang on -PRON- foot for good'

使用 Gensim 的词形还原示例：

from gensim.utils import lemmatize
sentence = "The striped bats were hanging on their feet and ate best fishes"
lemmatized_out = [wd.decode('utf-8').split('/')[0] for wd in lemmatize(sentence)]
#> ['striped', 'bat', 'be', 'hang', 'foot', 'eat', 'best', 'fish']

上面的示例借自此词形还原页面。

The top python packages (in no specific order) for lemmatization are: spacy, nltk, gensim, pattern, CoreNLP and TextBlob. I prefer spaCy and gensim's implementation (based on pattern) because they identify the POS tag of the word and assigns the appropriate lemma automatically. The gives more relevant lemmas, keeping the meaning intact.

If you plan to use nltk or TextBlob, you need to take care of finding the right POS tag manually and the find the right lemma.

Lemmatization Example with spaCy:

# Run below statements in terminal once. 
pip install spacy
spacy download en

import spacy

# Initialize spacy 'en' model
nlp = spacy.load('en', disable=['parser', 'ner'])

sentence = "The striped bats are hanging on their feet for best"

# Parse
doc = nlp(sentence)

# Extract the lemma
" ".join([token.lemma_ for token in doc])
#> 'the strip bat be hang on -PRON- foot for good'

Lemmatization Example With Gensim:

from gensim.utils import lemmatize
sentence = "The striped bats were hanging on their feet and ate best fishes"
lemmatized_out = [wd.decode('utf-8').split('/')[0] for wd in lemmatize(sentence)]
#> ['striped', 'bat', 'be', 'hang', 'foot', 'eat', 'best', 'fish']

The above examples were borrowed from in this lemmatization page.

回复收藏 0 原文

海未深 2024-07-23 23:06:49

如果我可以引用我对 StompChicken 提到的问题的回答：

这里的核心问题是词干算法在语音基础上运行，没有真正理解他们正在使用的语言。

因为他们没有理解由于它们不是语言的一部分并且不是从术语词典中运行的，因此它们无法识别和适当响应不规则的情况，例如“run”/“ran”。

如果您需要处理不规则的情况，则需要选择不同的方法或使用您自己的自定义更正字典来增强您的词干提取，以便在词干分析器完成其工作后运行。

回复收藏 0 原文

初吻给了烟 2024-07-23 23:06:49

NLTK 中词干分析器的最新版本是 Snowball。

您可以在此处找到有关如何使用它的示例：

http://nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.snowball2-pysrc.html#demo

回复收藏 0 原文

剑心龙吟 2024-07-23 23:06:49

您可以使用 Morpha 词干分析器。如果您打算使用，华盛顿大学已将 Morpha 词干分析器上传到 Maven 中心它来自 Java 应用程序。有一个包装器可以使它更容易使用。您只需将其添加为依赖项并使用 edu.washington.cs.knowitall.morpha.MorphaStemmer 类即可。实例是线程安全的（最初的 JFlex 有不必要的局部变量的类字段）。实例化一个类并运行 morpha 和您想要提取词干的单词。

new MorphaStemmer().morpha("climbed") // goes to "climb"

You could use the Morpha stemmer. UW has uploaded morpha stemmer to Maven central if you plan to use it from a Java application. There's a wrapper that makes it much easier to use. You just need to add it as a dependency and use the edu.washington.cs.knowitall.morpha.MorphaStemmer class. Instances are threadsafe (the original JFlex had class fields for local variables unnecessarily). Instantiate a class and run morpha and the word you want to stem.

new MorphaStemmer().morpha("climbed") // goes to "climb"

回复收藏 0 原文

哽咽笑 2024-07-23 23:06:49

搜索 Lucene，我不确定是否有 PHP 端口，但我知道 Lucene 可用于许多平台。 Lucene 是一个 OSS（来自 Apache）索引和搜索库。当然，它和社区的额外内容可能会有一些有趣的东西值得一看。至少您可以了解它是如何用一种语言完成的，这样您就可以将“想法”翻译成 PHP。

回复收藏 0 原文

初熏 2024-07-23 23:06:49

.Net lucene 有一个内置的 porter 词干分析器。你可以尝试一下。但请注意，波特词干提取在推导引理时不考虑单词上下文。（浏览一下算法及其实现，你就会明白它是如何工作的）

回复收藏 0 原文

暮倦 2024-07-23 23:06:49

Martin Porter 编写了 Snowball（一种词干算法语言），并在 Snowball 中重写了“英语词干分析器”。 C 和 Java 有一个英语词干分析器。

他明确指出，Porter Stemmer 仅由于历史原因而被重新实现，因此针对 Porter Stemmer 测试词干正确性将得到您（应该）已经知道的结果。

来自http://tartarus.org/~martin/PorterStemmer/index.html< /a>（强调我的）
Porter 词干分析器应被视为“冻结”，即严格定义，并且不适合进一步修改。作为词干分析器，它略逊于从它派生的 Snowball English 或 Porter2 词干分析器，并且偶尔会进行改进。因此，对于实际工作，建议使用新的 Snowball 词干分析器。 Porter 词干分析器适用于涉及词干提取的 IR 研究工作，其中实验需要完全可重复。

Porter 博士建议使用 English 或 Porter2 词干分析器而不是 Porter 词干分析器。英语词干分析器是演示网站中实际使用的 @StompChicken之前已经回答过。

回复收藏 0 原文

街角迷惘 2024-07-23 23:06:49

在 Java 中，我使用 tartargus-snowball 来词干

Maven:

<dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-snowball</artifactId>
        <version>3.0.3</version>
        <scope>test</scope>
</dependency>

<示例代码：

SnowballProgram stemmer = new EnglishStemmer();
String[] words = new String[]{
    "testing",
    "skincare",
    "eyecare",
    "eye",
    "worked",
    "read"
};
for (String word : words) {
    stemmer.setCurrent(word);
    stemmer.stem();
    //debug
    logger.info("Origin: " + word + " > " + stemmer.getCurrent());// result: test, skincar, eyecar, eye, work, read
}

In Java, i use tartargus-snowball to stemming words

Maven:

<dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-snowball</artifactId>
        <version>3.0.3</version>
        <scope>test</scope>
</dependency>

Sample code:

SnowballProgram stemmer = new EnglishStemmer();
String[] words = new String[]{
    "testing",
    "skincare",
    "eyecare",
    "eye",
    "worked",
    "read"
};
for (String word : words) {
    stemmer.setCurrent(word);
    stemmer.stem();
    //debug
    logger.info("Origin: " + word + " > " + stemmer.getCurrent());// result: test, skincar, eyecar, eye, work, read
}

回复收藏 0 原文

顾冷 2024-07-23 23:06:49

在这里试试这个：http://www.twinword.com/lemmatizer.php

我输入了您的在演示中查询“cats running ran cactus cactuses cacti Community communications”并得到["cat", "running", "run", "cactus", "cactus", "cactus" , "community", "community"] 带有可选标志 ALL_TOKENS。

示例代码

这是一个 API，因此您可以从任何环境连接到它。 PHP REST 调用可能如下所示。

// These code snippets use an open-source library. http://unirest.io/php
$response = Unirest\Request::post([ENDPOINT],
  array(
    "X-Mashape-Key" => [API KEY],
    "Content-Type" => "application/x-www-form-urlencoded",
    "Accept" => "application/json"
  ),
  array(
    "text" => "cats running ran cactus cactuses cacti community communities"
  )
);

Try this one here: http://www.twinword.com/lemmatizer.php

I entered your query in the demo "cats running ran cactus cactuses cacti community communities" and got ["cat", "running", "run", "cactus", "cactus", "cactus", "community", "community"] with the optional flag ALL_TOKENS.

Sample Code

This is an API so you can connect to it from any environment. Here is what the PHP REST call may look like.

// These code snippets use an open-source library. http://unirest.io/php
$response = Unirest\Request::post([ENDPOINT],
  array(
    "X-Mashape-Key" => [API KEY],
    "Content-Type" => "application/x-www-form-urlencoded",
    "Accept" => "application/json"
  ),
  array(
    "text" => "cats running ran cactus cactuses cacti community communities"
  )
);

回复收藏 0 原文

樱娆 2024-07-23 23:06:49

我强烈建议使用 Spacy （基本文本解析和标记）和 Textacy（建立在 Spacy 之上的更高级别的文本处理）。

词形还原的单词默认情况下在 Spacy 中可用作为标记的 .lemma_在使用 textacy 进行许多其他文本预处理时，可以对属性和文本进行词形还原。例如创建术语包时或单词或通常在执行某些操作之前需要它的处理。

我鼓励您在编写任何代码之前检查两者，因为这可能会节省您很多时间！

回复收藏 0 原文

倾听心声的旋律 2024-07-23 23:06:49

import re
import pymorphy2
from pymorphy2 import MorphAnalyzer
import nltk
from nltk.tokenize import  word_tokenize
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')

stopwords_ru = stopwords.words("russian")
morph = MorphAnalyzer()

def to_lowercase(data):
    
    data = data.lower()
return data

def noise_remove(data, remove_numbers=True):
    
    data = re.sub(r"(\w+:\/\/\S+)", " ", data)

    
    data = re.sub(r"([^0-9A-Za-zА-Яа-я])", " ", data)


if remove_numbers:
    data = re.sub(r"\d+", " ", data)
return data


def lemmatize(words):
    text = []
    
    for word in words:
        morph_word = morph.parse(word)[0]
        if morph_word.tag.POS in ['NOUN', 'ADJF', 'INFN', 'PRTS'] and morph_word[2] not in stopwords_ru:
            text.append(morph_word[2])
   return text

def tokenize(text):
    words = text.split()
    for elem in words:
        if len(elem) < 3:
            words.remove(elem)
    lemmatize_words = lemmatize(words)
    return ' '.join(lemmatize_words)

import re
import pymorphy2
from pymorphy2 import MorphAnalyzer
import nltk
from nltk.tokenize import  word_tokenize
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')

stopwords_ru = stopwords.words("russian")
morph = MorphAnalyzer()

def to_lowercase(data):
    
    data = data.lower()
return data

def noise_remove(data, remove_numbers=True):
    
    data = re.sub(r"(\w+:\/\/\S+)", " ", data)

    
    data = re.sub(r"([^0-9A-Za-zА-Яа-я])", " ", data)


if remove_numbers:
    data = re.sub(r"\d+", " ", data)
return data


def lemmatize(words):
    text = []
    
    for word in words:
        morph_word = morph.parse(word)[0]
        if morph_word.tag.POS in ['NOUN', 'ADJF', 'INFN', 'PRTS'] and morph_word[2] not in stopwords_ru:
            text.append(morph_word[2])
   return text

def tokenize(text):
    words = text.split()
    for elem in words:
        if len(elem) < 3:
            words.remove(elem)
    lemmatize_words = lemmatize(words)
    return ' '.join(lemmatize_words)

回复收藏 0 原文

一梦浮鱼 2024-07-23 23:06:49

df_plots = pd.read_excel("Plot Summary.xlsx", index_col = 0)
df_plots
# Printing first sentence of first row and last sentence of last row
nltk.sent_tokenize(df_plots.loc[1].Plot)[0] + nltk.sent_tokenize(df_plots.loc[len(df)].Plot)[-1]

# Calculating length of all plots by words
df_plots["Length"] = df_plots.Plot.apply(lambda x : 
len(nltk.word_tokenize(x)))

print("Longest plot is for season"),
print(df_plots.Length.idxmax())

print("Shortest plot is for season"),
print(df_plots.Length.idxmin())



#What is this show about? (What are the top 3 words used , excluding the #stop words, in all the #seasons combined)

word_sample = list(["struggled", "died"])
word_list = nltk.pos_tag(word_sample)
[wnl.lemmatize(str(word_list[index][0]), pos = word_list[index][1][0].lower()) for index in range(len(word_list))]

# Figure out the stop words
stop = (stopwords.words('english'))

# Tokenize all the plots
df_plots["Tokenized"] = df_plots.Plot.apply(lambda x : nltk.word_tokenize(x.lower()))

# Remove the stop words
df_plots["Filtered"] = df_plots.Tokenized.apply(lambda x : (word for word in x if word not in stop))

# Lemmatize each word
wnl = WordNetLemmatizer()
df_plots["POS"] = df_plots.Filtered.apply(lambda x : nltk.pos_tag(list(x)))
# df_plots["POS"] = df_plots.POS.apply(lambda x : ((word[1] = word[1][0] for word in word_list) for word_list in x))
df_plots["Lemmatized"] = df_plots.POS.apply(lambda x : (wnl.lemmatize(x[index][0], pos = str(x[index][1][0]).lower()) for index in range(len(list(x)))))



#Which Season had the highest screenplay of "Jesse" compared to "Walt" 
#Screenplay of Jesse =(Occurences of "Jesse")/(Occurences of "Jesse"+ #Occurences of "Walt")

df_plots.groupby("Season").Tokenized.sum()

df_plots["Share"] = df_plots.groupby("Season").Tokenized.sum().apply(lambda x : float(x.count("jesse") * 100)/float(x.count("jesse") + x.count("walter") + x.count("walt")))

print("The highest times Jesse was mentioned compared to Walter/Walt was in season"),
print(df_plots["Share"].idxmax())
#float(df_plots.Tokenized.sum().count('jesse')) * 100 / #float((df_plots.Tokenized.sum().count('jesse') + #df_plots.Tokenized.sum().count('walt') + #df_plots.Tokenized.sum().count('walter')))

df_plots = pd.read_excel("Plot Summary.xlsx", index_col = 0)
df_plots
# Printing first sentence of first row and last sentence of last row
nltk.sent_tokenize(df_plots.loc[1].Plot)[0] + nltk.sent_tokenize(df_plots.loc[len(df)].Plot)[-1]

# Calculating length of all plots by words
df_plots["Length"] = df_plots.Plot.apply(lambda x : 
len(nltk.word_tokenize(x)))

print("Longest plot is for season"),
print(df_plots.Length.idxmax())

print("Shortest plot is for season"),
print(df_plots.Length.idxmin())



#What is this show about? (What are the top 3 words used , excluding the #stop words, in all the #seasons combined)

word_sample = list(["struggled", "died"])
word_list = nltk.pos_tag(word_sample)
[wnl.lemmatize(str(word_list[index][0]), pos = word_list[index][1][0].lower()) for index in range(len(word_list))]

# Figure out the stop words
stop = (stopwords.words('english'))

# Tokenize all the plots
df_plots["Tokenized"] = df_plots.Plot.apply(lambda x : nltk.word_tokenize(x.lower()))

# Remove the stop words
df_plots["Filtered"] = df_plots.Tokenized.apply(lambda x : (word for word in x if word not in stop))

# Lemmatize each word
wnl = WordNetLemmatizer()
df_plots["POS"] = df_plots.Filtered.apply(lambda x : nltk.pos_tag(list(x)))
# df_plots["POS"] = df_plots.POS.apply(lambda x : ((word[1] = word[1][0] for word in word_list) for word_list in x))
df_plots["Lemmatized"] = df_plots.POS.apply(lambda x : (wnl.lemmatize(x[index][0], pos = str(x[index][1][0]).lower()) for index in range(len(list(x)))))



#Which Season had the highest screenplay of "Jesse" compared to "Walt" 
#Screenplay of Jesse =(Occurences of "Jesse")/(Occurences of "Jesse"+ #Occurences of "Walt")

df_plots.groupby("Season").Tokenized.sum()

df_plots["Share"] = df_plots.groupby("Season").Tokenized.sum().apply(lambda x : float(x.count("jesse") * 100)/float(x.count("jesse") + x.count("walter") + x.count("walt")))

print("The highest times Jesse was mentioned compared to Walter/Walt was in season"),
print(df_plots["Share"].idxmax())
#float(df_plots.Tokenized.sum().count('jesse')) * 100 / #float((df_plots.Tokenized.sum().count('jesse') + #df_plots.Tokenized.sum().count('walt') + #df_plots.Tokenized.sum().count('walter')))

回复收藏 0 原文

如何进行词干提取或词形还原？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（22）

关于作者

相关话题

热门标签

推荐作者

西西弗的石头怪

5397313

烟沫凡尘

一个破名字

萌︼了一个春

当爱已成负担

友情链接

如何进行词干提取或词形还原？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（22）

关于作者

相关话题

热门标签

推荐作者

西西弗的石头怪

5397313

烟沫凡尘

一个破名字

萌︼了一个春

当爱已成负担

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。