Spacy vs NLTK Word Tokenize基准测试

发布于 2025-01-23 16:19:14 字数 715 浏览 0 评论 0原文

[我的代码] [1]

导入nltk导入spacy nlp = spacy.load(“ en_core_web_sm”,disable = ['parser','ner','tok2vec','tagger','attribute_ruler' .max_length = 4532554

bigdata = open(“/users/users/harikaranharithas/downloads/data/data/en en.txt”,'r') bigdata_r = bigdata.read()

nw =%timeit -o nltk.tokenize.word_tokenize(bigdata_r [0:1000000]) 出局-1.35 s±139毫秒 /循环(平均±std。7次运行,每个循环1循环) SW =%TimeIt -o nlp(bigdata_r [0:1000000]) 出局-125 ms±3.72 ms每个循环(平均±标准dev。7次运行,每个循环1)

大数据是一个文本文件,由几个Wikipedia文章(2B单词)组成,

我的PC规格是: MacBook Pro(16英寸,2019年) 2,6 GHz 6核Intel Core i7 16 GB 2667 MHz DDR4 Intel UHD图形630 1536 MB(+4GB ra 5500)的

速度比NLTK快,我做错了什么。我在论文中读到,在单词令牌化中,Spacy在某种程度上禁食了8次,如何正确地基准spacy和nltk?

[My code][1]

import nltk import spacy nlp = spacy.load("en_core_web_sm",disable = [ 'parser', 'ner','tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer']) nlp.max_length = 4532554

BIGDATA = open("/Users/harikaranharithas/Downloads/data/en/en.txt",'r')
BIGDATA_R = BIGDATA.read()

Nw = %timeit -o nltk.tokenize.word_tokenize(BIGDATA_R[0:1000000])
OUT - 1.35 s ± 139 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sw = %timeit -o nlp(BIGDATA_R[0:1000000])
OUT - 125 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Big data is a text file consisting of several Wikipedia article(2B words)

My PC specs are:
MacBook Pro (16-inch, 2019)
2,6 GHz 6-Core Intel Core i7
16 GB 2667 MHz DDR4
Intel UHD Graphics 630 1536 MB(+4GB radon 5500)

Isn't Spacy faster than NLTK what am I doing wrong. I have read in papers that spacy is somewhat 8 times fasted in word tokenization how to benchmark spacy and NLTK correctly?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

沙沙粒小 2025-01-30 16:19:14

将文档传递到nlp可能不仅仅是执行令牌。

您可以尝试明确使用令牌器吗?

from spacy.lang.en import English
nlp = English()

# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.tokenizer

tokenizer(BIGDATA_R[0:100000])

Passing the documents to nlp might be doing more than just tokenizing.

Can you try to explicitly use only the tokenizer?

from spacy.lang.en import English
nlp = English()

# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.tokenizer

tokenizer(BIGDATA_R[0:100000])
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文