使用 Python 进行实时文本处理
使用 Python 进行实时文本处理。例如,考虑这句话,
I am going to schol today
我想执行以下操作(实时):
1) tokenize 2) check spellings 3) stem(nltk.PorterStemmer()) 4) lemmatize (nltk.WordNetLemmatizer())
目前我正在使用 NLTK 库来执行这些操作,但它不是实时的(意味着需要几秒钟才能完成这些操作)。我一次处理 1 句话,是否可以提高效率
更新: 分析:
Fri Jul 8 17:59:32 2011 srj.profile 105503 function calls (101919 primitive calls) in 1.743 CPU seconds Ordered by: internal time List reduced from 1797 to 10 due to restriction ncalls tottime percall cumtime percall filename:lineno(function) 7450 0.136 0.000 0.208 0.000 sre_parse.py:182(__next) 602/179 0.130 0.000 0.583 0.003 sre_parse.py:379(_parse) 23467/22658 0.122 0.000 0.130 0.000 {len} 1158/142 0.092 0.000 0.313 0.002 sre_compile.py:32(_compile) 16152 0.081 0.000 0.081 0.000 {method 'append' of 'list' objects} 6365 0.070 0.000 0.249 0.000 sre_parse.py:201(get) 4947 0.058 0.000 0.086 0.000 sre_parse.py:130(__getitem__) 1641/639 0.039 0.000 0.055 0.000 sre_parse.py:140(getwidth) 457 0.035 0.000 0.103 0.000 sre_compile.py:207(_optimize_charset) 6512 0.034 0.000 0.034 0.000 {isinstance}
时间限制:
t = timeit.Timer(main) print t.timeit(1000) => 3.7256231308
Real time text processing using Python. For e.g. consider this sentance
I am going to schol today
I want to do the following (real time):
1) tokenize 2) check spellings 3) stem(nltk.PorterStemmer()) 4) lemmatize (nltk.WordNetLemmatizer())
Currently I am using NLTK library to do these operations, but its not real time (meaning its taking few seconds to complete these operations). I am processing 1 sentence at a time, Is it possible to make it efficient
Update:
Profiling:
Fri Jul 8 17:59:32 2011 srj.profile 105503 function calls (101919 primitive calls) in 1.743 CPU seconds Ordered by: internal time List reduced from 1797 to 10 due to restriction ncalls tottime percall cumtime percall filename:lineno(function) 7450 0.136 0.000 0.208 0.000 sre_parse.py:182(__next) 602/179 0.130 0.000 0.583 0.003 sre_parse.py:379(_parse) 23467/22658 0.122 0.000 0.130 0.000 {len} 1158/142 0.092 0.000 0.313 0.002 sre_compile.py:32(_compile) 16152 0.081 0.000 0.081 0.000 {method 'append' of 'list' objects} 6365 0.070 0.000 0.249 0.000 sre_parse.py:201(get) 4947 0.058 0.000 0.086 0.000 sre_parse.py:130(__getitem__) 1641/639 0.039 0.000 0.055 0.000 sre_parse.py:140(getwidth) 457 0.035 0.000 0.103 0.000 sre_compile.py:207(_optimize_charset) 6512 0.034 0.000 0.034 0.000 {isinstance}
timit:
t = timeit.Timer(main) print t.timeit(1000) => 3.7256231308
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
NLTK 的
WordNetLemmatizer
使用 延迟加载的 WordNetCorpusReader(使用LazyCorpusLoader
)。如果第一次调用 lemmatize() 会触发语料库加载,那么它可能会比后面的调用花费更长的时间。您可以对
lemmatize()
进行虚拟调用,以在应用程序启动时触发加载。NLTK's
WordNetLemmatizer
uses a lazily-loaded WordNetCorpusReader (using aLazyCorpusLoader
). The first call tolemmatize()
may take significantly longer than later calls if it triggers the corpus loading.You could place a dummy call to
lemmatize()
to trigger the loading when your application starts up.我知道 NLTK 很慢,但我很难相信它这么慢。无论如何,首先进行词干提取,然后进行词形还原是一个坏主意,因为这些操作具有相同的目的,并且将词干分析器的输出提供给词形还原器必然会产生比仅仅词形还原更糟糕的结果。因此,跳过词干分析器可以提高性能和准确性。
I know NLTK is slow, but I can hardly believe it's that slow. In any case, first stemming, then lemmatizing is a bad idea, since these operations serve the same purpose and feeding the output from a stemmer to a lemmatizer is bound to give worse results than just lemmatizing. So skip the stemmer for an increase in both performance and accuracy.
不可能这么慢吧我敢打赌正在加载工具和数据来进行词干提取等。如前所述,运行一些测试 - 1 个句子、10 个句子、100 个句子。
或者,Stanford 解析器可以做同样的事情,并且基于 Java(或 LingPipe)可能会更快一些,但 NLTK 更加用户友好。
No way it's that slow. I bet what's happening is loading the tools and data to do the stemming etc. As said earlier, run a few tests- 1 sentence, 10 sentences, 100 sentences.
Alternatively, the Stanford parser can do the same stuff and might be a bit quicker being Java based (or LingPipe) but NLTK is waaaaaay more user friendly.