使用 Python 进行实时文本处理

发布于 2024-11-18 19:49:48 字数 1445 浏览 6 评论 0原文

使用 Python 进行实时文本处理。例如，考虑这句话，

I am going to schol today

我想执行以下操作（实时）：

1) tokenize 
2) check spellings
3) stem(nltk.PorterStemmer()) 
4) lemmatize (nltk.WordNetLemmatizer())

目前我正在使用 NLTK 库来执行这些操作，但它不是实时的（意味着需要几秒钟才能完成这些操作）。我一次处理 1 句话，是否可以提高效率

更新：分析：

Fri Jul  8 17:59:32 2011    srj.profile

         105503 function calls (101919 primitive calls) in 1.743 CPU seconds

   Ordered by: internal time
   List reduced from 1797 to 10 due to restriction 

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     7450    0.136    0.000    0.208    0.000 sre_parse.py:182(__next)
  602/179    0.130    0.000    0.583    0.003 sre_parse.py:379(_parse)
23467/22658    0.122    0.000    0.130    0.000 {len}
 1158/142    0.092    0.000    0.313    0.002 sre_compile.py:32(_compile)
    16152    0.081    0.000    0.081    0.000 {method 'append' of 'list' objects}
     6365    0.070    0.000    0.249    0.000 sre_parse.py:201(get)
     4947    0.058    0.000    0.086    0.000 sre_parse.py:130(__getitem__)
 1641/639    0.039    0.000    0.055    0.000 sre_parse.py:140(getwidth)
      457    0.035    0.000    0.103    0.000 sre_compile.py:207(_optimize_charset)
     6512    0.034    0.000    0.034    0.000 {isinstance}

时间限制：

t = timeit.Timer(main)
print t.timeit(1000)

=> 3.7256231308

原文

Real time text processing using Python. For e.g. consider this sentance

I am going to schol today

I want to do the following (real time):

1) tokenize 
2) check spellings
3) stem(nltk.PorterStemmer()) 
4) lemmatize (nltk.WordNetLemmatizer())

Currently I am using NLTK library to do these operations, but its not real time (meaning its taking few seconds to complete these operations). I am processing 1 sentence at a time, Is it possible to make it efficient

Update:
Profiling:

Fri Jul  8 17:59:32 2011    srj.profile

         105503 function calls (101919 primitive calls) in 1.743 CPU seconds

   Ordered by: internal time
   List reduced from 1797 to 10 due to restriction 

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     7450    0.136    0.000    0.208    0.000 sre_parse.py:182(__next)
  602/179    0.130    0.000    0.583    0.003 sre_parse.py:379(_parse)
23467/22658    0.122    0.000    0.130    0.000 {len}
 1158/142    0.092    0.000    0.313    0.002 sre_compile.py:32(_compile)
    16152    0.081    0.000    0.081    0.000 {method 'append' of 'list' objects}
     6365    0.070    0.000    0.249    0.000 sre_parse.py:201(get)
     4947    0.058    0.000    0.086    0.000 sre_parse.py:130(__getitem__)
 1641/639    0.039    0.000    0.055    0.000 sre_parse.py:140(getwidth)
      457    0.035    0.000    0.103    0.000 sre_compile.py:207(_optimize_charset)
     6512    0.034    0.000    0.034    0.000 {isinstance}

timit:

t = timeit.Timer(main)
print t.timeit(1000)

=> 3.7256231308

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

独留℉清风醉 2024-11-25 19:49:48

NLTK 的 WordNetLemmatizer 使用延迟加载的 WordNetCorpusReader（使用 LazyCorpusLoader）。如果第一次调用 lemmatize() 会触发语料库加载，那么它可能会比后面的调用花费更长的时间。

您可以对 lemmatize() 进行虚拟调用，以在应用程序启动时触发加载。

回复收藏 0 原文

ˉ厌 2024-11-25 19:49:48

我知道 NLTK 很慢，但我很难相信它这么慢。无论如何，首先进行词干提取，然后进行词形还原是一个坏主意，因为这些操作具有相同的目的，并且将词干分析器的输出提供给词形还原器必然会产生比仅仅词形还原更糟糕的结果。因此，跳过词干分析器可以提高性能和准确性。

回复收藏 0 原文

北方的韩爷 2024-11-25 19:49:48

不可能这么慢吧我敢打赌正在加载工具和数据来进行词干提取等。如前所述，运行一些测试 - 1 个句子、10 个句子、100 个句子。

或者，Stanford 解析器可以做同样的事情，并且基于 Java（或 LingPipe）可能会更快一些，但 NLTK 更加用户友好。

回复收藏 0 原文

~没有更多了~

关于作者

终难遇

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

使用 Python 进行实时文本处理

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

佚名

羁客

天天爱笑的徐老师

星

夏日落

隐诗

友情链接

使用 Python 进行实时文本处理

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

佚名

羁客

天天爱笑的徐老师

星

夏日落

隐诗

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。