Spacy vs NLTK Word Tokenize基准测试
[我的代码] [1]
导入nltk导入spacy nlp = spacy.load(“ en_core_web_sm”,disable = ['parser','ner','tok2vec','tagger','attribute_ruler' .max_length = 4532554
bigdata = open(“/users/users/harikaranharithas/downloads/data/data/en en.txt”,'r') bigdata_r = bigdata.read()
nw =%timeit -o nltk.tokenize.word_tokenize(bigdata_r [0:1000000]) 出局-1.35 s±139毫秒 /循环(平均±std。7次运行,每个循环1循环) SW =%TimeIt -o nlp(bigdata_r [0:1000000]) 出局-125 ms±3.72 ms每个循环(平均±标准dev。7次运行,每个循环1)
大数据是一个文本文件,由几个Wikipedia文章(2B单词)组成,
我的PC规格是: MacBook Pro(16英寸,2019年) 2,6 GHz 6核Intel Core i7 16 GB 2667 MHz DDR4 Intel UHD图形630 1536 MB(+4GB ra 5500)的
速度比NLTK快,我做错了什么。我在论文中读到,在单词令牌化中,Spacy在某种程度上禁食了8次,如何正确地基准spacy和nltk?
[My code][1]
import nltk import spacy nlp = spacy.load("en_core_web_sm",disable = [ 'parser', 'ner','tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer']) nlp.max_length = 4532554
BIGDATA = open("/Users/harikaranharithas/Downloads/data/en/en.txt",'r')
BIGDATA_R = BIGDATA.read()
Nw = %timeit -o nltk.tokenize.word_tokenize(BIGDATA_R[0:1000000])
OUT - 1.35 s ± 139 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sw = %timeit -o nlp(BIGDATA_R[0:1000000])
OUT - 125 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Big data is a text file consisting of several Wikipedia article(2B words)
My PC specs are:
MacBook Pro (16-inch, 2019)
2,6 GHz 6-Core Intel Core i7
16 GB 2667 MHz DDR4
Intel UHD Graphics 630 1536 MB(+4GB radon 5500)
Isn't Spacy faster than NLTK what am I doing wrong. I have read in papers that spacy is somewhat 8 times fasted in word tokenization how to benchmark spacy and NLTK correctly?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
将文档传递到
nlp
可能不仅仅是执行令牌。您可以尝试明确使用令牌器吗?
Passing the documents to
nlp
might be doing more than just tokenizing.Can you try to explicitly use only the tokenizer?