更快的python lemmatization
我一直在测试不同的柠檬酸方法,因为它将用于非常大的语料库。以下是我的方法和结果。有人有任何提示加快这些方法的提示吗? Spacy是最快的,其中包括语音标签的一部分(首选),其次是Lemminflect。我是错误的方式吗?这些功能正在包含文本的数据框架上使用pandas .apply()。
def prepareString_nltk_current(x):
lemmatizer = WordNetLemmatizer()
x = re.sub(r"[^0-9a-z]", " ", x)
if len(x)==0:
return ''
tokens = word_tokenize(x)
tokens = [lemmatizer.lemmatize(word).strip() for word in tokens if word not in stop_words]
if len(tokens)==0:
return ''
return ' '.join(map(str,tokens))
def prepareString_pattern(x):
error = 'Error'
x = re.sub(r"[^0-9a-z.,;]", " ", x)
if len(x)==0:
return ''
try:
return " ".join([lemma(wd) if wd not in ['this', 'his'] else wd for wd in x.split()])
except StopIteration:
return error
def prepareString_pattern(x):
error = 'Error'
x = re.sub(r"[^0-9a-z.,;]", " ", x)
if len(x)==0:
return ''
try:
return " ".join([lemma(wd) if wd not in ['this', 'his'] else wd for wd in x.split()])
except StopIteration:
return error
def prepareString_spacy_pretrained(x):
if len(x)==0:
return ''
doc = nlp(x)
return re.sub(r"[^0-9a-zA-Z]", " ", " ".join(str(token.lemma) for token in doc)).lower()
def get_wordnet_pos(word):
lemmatizer = WordNetLemmatizer()
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": 'a',
"N": 'n',
"V": 'v',
"R": 'r'}
return lemmatizer.lemmatize(word, tag_dict.get(tag, 'n'))
def prepareString_nltk_pos(x):
tokens = word_tokenize(x)
if len(x)==0:
return ''
return " ".join(get_wordnet_pos(w) for w in tokens)
def prepareString_textblob(x):
sent = TextBlob(x)
tag_dict = {"J": 'a',
"N": 'n',
"V": 'v',
"R": 'r'}
words_and_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in sent.tags]
return " ".join([wd.lemmatize(tag) for wd, tag in words_and_tags])
def prepareString_genism(x):
return " ".join([wd.decode('utf-8').split('/')[0] for wd in lemmatize(x)])
def prepareString_leminflect(x):
doc = nlp(x)
return " ".join([str(x._.lemma) for x in doc])
def prepareString_pattern_pos(x):
s = parsetree(x, tags=True, lemmata=True)
for sentence in s:
return re.sub(r"[^0-9a-zA-Z]", " ", " ".join([str(x._.lemma()) for x in doc])).lower()
I have been testing different lemmatization methods since it will be used on a very large corpus. Below are my methods and results. Does anyone have any tips to speed any of these methods up? Spacy was the fastest with part of speech tags included (preferred), followed by lemminflect. Am I going about this the wrong way? These functions are being applied with pandas .apply() on a dataframe containing the text.
def prepareString_nltk_current(x):
lemmatizer = WordNetLemmatizer()
x = re.sub(r"[^0-9a-z]", " ", x)
if len(x)==0:
return ''
tokens = word_tokenize(x)
tokens = [lemmatizer.lemmatize(word).strip() for word in tokens if word not in stop_words]
if len(tokens)==0:
return ''
return ' '.join(map(str,tokens))
def prepareString_pattern(x):
error = 'Error'
x = re.sub(r"[^0-9a-z.,;]", " ", x)
if len(x)==0:
return ''
try:
return " ".join([lemma(wd) if wd not in ['this', 'his'] else wd for wd in x.split()])
except StopIteration:
return error
def prepareString_pattern(x):
error = 'Error'
x = re.sub(r"[^0-9a-z.,;]", " ", x)
if len(x)==0:
return ''
try:
return " ".join([lemma(wd) if wd not in ['this', 'his'] else wd for wd in x.split()])
except StopIteration:
return error
def prepareString_spacy_pretrained(x):
if len(x)==0:
return ''
doc = nlp(x)
return re.sub(r"[^0-9a-zA-Z]", " ", " ".join(str(token.lemma) for token in doc)).lower()
def get_wordnet_pos(word):
lemmatizer = WordNetLemmatizer()
"""Map POS tag to first character lemmatize() accepts"""
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {"J": 'a',
"N": 'n',
"V": 'v',
"R": 'r'}
return lemmatizer.lemmatize(word, tag_dict.get(tag, 'n'))
def prepareString_nltk_pos(x):
tokens = word_tokenize(x)
if len(x)==0:
return ''
return " ".join(get_wordnet_pos(w) for w in tokens)
def prepareString_textblob(x):
sent = TextBlob(x)
tag_dict = {"J": 'a',
"N": 'n',
"V": 'v',
"R": 'r'}
words_and_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in sent.tags]
return " ".join([wd.lemmatize(tag) for wd, tag in words_and_tags])
def prepareString_genism(x):
return " ".join([wd.decode('utf-8').split('/')[0] for wd in lemmatize(x)])
def prepareString_leminflect(x):
doc = nlp(x)
return " ".join([str(x._.lemma) for x in doc])
def prepareString_pattern_pos(x):
s = parsetree(x, tags=True, lemmata=True)
for sentence in s:
return re.sub(r"[^0-9a-zA-Z]", " ", " ".join([str(x._.lemma()) for x in doc])).lower()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为是需要时间的,而不是实际的诱捕性,这是Spacy解析(创建POS标签等)。从Lemminflect的Reame中,该图书馆平均每遍42US(不包括解析)。看来您的花费更像是42毫秒(即1044S / 26536引理)。这意味着您确实需要加快Spacy的解析。
您还可以使用呼叫 noreflow noreferrer“>使用param
lemmatize_oov = false
。这只会使词典引理查找非常快。它不会诱使vocab单词(即错误的词,稀有的单词,...)的诱惑,这要慢得多。请注意,您必须解析句子才能获取upos
。在Spacy中,我认为这是token.pos _
。参见 part-of-of-Spech标签 lemminflect的期望和spacy的文档以验证是否验证这是.pos _
属性。但是,我认为您的最大问题是解析和诱饵速度的小变化不会影响您很大。
我还应该指出,解析只有在句子中有自己的话时才有效。从您的代码看来,您正在正确执行此操作,但我无法确定。请确保您是,因为解析器只能给出一个单词或一个小文本片段,就无法选择正确的POS。
I think it's the Spacy parsing (creating the POS tags, etc) that takes the time, not the actual lematization. From Lemminflect's REAME, that library takes on average 42uS per lemma (not including parsing). It looks like you're spending more like 42mS (ie.. 1044s / 26536 lemmas). This means you really need to speed up Spacy's parsing.
You can also speed up Lemminflect a bit by using the call getLemmas() with the param
lemmatize_oov=False
. This will only do dictionary lemma look-up which is very fast. It will not lematize out-of-vocab words (ie.. mispellings, rare words,...) which is much slower. Note that you'll have to parse the sentences to get theupos
. In Spacy I think this istoken.pos_
. See Part-Of-Speech Tags for what lemminflect expects and Spacy's docs to verify if this is the.pos_
attribute.However, I think your big issue is the parsing and small changes in the lematization speed aren't going to impact you much.
I should also point out that parsing only works if you have your word in a sentence. From your code it looks like you're doing this correctly but I can't tell for sure. Be sure you are since the parser can't select the correct POS if you only give it a single word or a small fragment of text.