Python和tfidf算法,让它更快吗?
我正在使用 Python 在 Web 应用程序中实现 tf-idf 算法,但是它运行速度非常慢。我基本上做的是:
1)创建2个字典:
- 第一个字典:键(文档ID),值(文档中所有找到的单词(包括重复)的列表)
- 第二个字典; key(文档id),value(包含文档唯一单词的集合)
现在,有一个用户请求获取文档d的tfidf结果。我所做的是:
2)循环遍历文档d的第二个字典的唯一单词,对于每个唯一单词w得到:
2.1)tf分数(w在d中出现的次数:循环遍历单词列表文档的第一个字典)
2.2)df分数(有多少文档包含w:循环遍历所有文档(第二个字典)的单词集并检查是否包含w)。我使用的是集合,因为与列表相比,检查集合是否包含单词似乎更快。
步骤 2.2 非常慢。例如,有 1000 个文档,对于包含 2313 个唯一单词的文档,输出结果大约需要 5 分钟。
有没有其他方法可以使步骤 2.2 更快?字典的迭代速度很慢吗?
I am implementing the tf-idf algorithm in a web application using Python, however it runs extremely slow. What I basically do is:
1) Create 2 dictionaries:
- First dictionary: key (document id), value (list of all found words (incl. repeated) in doc)
- Second dictionary; key (document id), value (set containing unique words of the doc)
Now, there is a petition of a user to get tfidf results of document d. What I do is:
2) Loop over the unique words of the second dictionary for the document d, and for each unique word w get:
2.1) tf score (how many times w appears in d: loop over the the list of words of the first dictionary for the document)
2.2) df score (how many docs contain w: looping over the set of words of all documents (second dictionary) and check if w is contained). I am using a set since it seems to be faster for checking if a set contains a word compared to a list.
Step 2.2 is terribly slow. For example, having 1000 documents, and for a document with 2313 unique words, it takes around 5 minutes to output the results.
Is there any other way to make step 2.2 faster? Are dictionaries that slow for iterating?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
好吧,您必须以某种方式重新思考和重新设计保存数据的方式,或者换句话说,实现“倒排索引”的“正统”版本。
您的瓶颈是术语的文档频率 (DF) 的“即时”计算。动态化是一个聪明的主意,因此每次更新语料库(文档集合)时,请进行一些处理并更新文档中每个术语的 DF(当然,以持久的方式保存结果) ,又名数据库等..)。
您需要的唯一结构是一个嵌套字典,就像
每次“喂养”语料库时正确更新的那样。
当然,将你的语料库基数保留在某个地方......
作为一种爱好和我工作的一部分,我正在实现一个 python - redis 支持的小型搜索引擎。您可能还会得到一些其他想法。请查看此处。
Well, you have to re-think and re-design somehow the way you hold your data, or in other words, implement an "orthodox" version of your "inverted index".
Your bottleneck is the "on-the-fly" calculation of the document frequency (DF) for the terms. It would be a clever idea for this to be dynamic, so every time you update your corpus (collection of documents), do some processing and update the DFs for every term in a document (and of course, save the results in a persistent way, aka a database etc..) .
The only structure you need is a nested dictionary like that
properly updated every time you "feed" your corpus.
And of course, keep somewhere your corpus cardinality...
As a hobby and part of my work, I am implementing a python - redis backed small search engine. You might get some other ideas as well. Take a look here.
这是一项学术努力还是为了生产?如果您正在实施生产,为什么不使用现有的东西(即 http://code.google.html? com/p/tfidf/)?另一方面,如果您将其作为一项学术练习,我仍然可能会关注现有的实现,看看它们在做什么不同(如果有的话)。
我还建议使用
cProfile
来分析您的代码看看费用在哪里。Is this an academic endeavor or are you doing it for production? If you're implementing for production, why not using something already available (i.e. http://code.google.com/p/tfidf/)? On the other hand, if you're doing it as an academic exercise, I might still take a gander at an existing implementation to see what they're doing differently (if anything at all).
I'd also suggest using
cProfile
to profile your code to see where the expense is.