当前位置：文江博客话题详情

Python和tfidf算法，让它更快吗？

发布于 2024-12-01 11:45:11 字数 583 浏览 0 评论 0原文

我正在使用 Python 在 Web 应用程序中实现 tf-idf 算法，但是它运行速度非常慢。我基本上做的是：

1）创建2个字典：

第一个字典：键（文档ID），值（文档中所有找到的单词（包括重复）的列表）
第二个字典； key（文档id），value（包含文档唯一单词的集合）

现在，有一个用户请求获取文档d的tfidf结果。我所做的是：

2）循环遍历文档d的第二个字典的唯一单词，对于每个唯一单词w得到：

2.1）tf分数（w在d中出现的次数：循环遍历单词列表文档的第一个字典）

2.2）df分数（有多少文档包含w：循环遍历所有文档（第二个字典）的单词集并检查是否包含w）。我使用的是集合，因为与列表相比，检查集合是否包含单词似乎更快。

步骤 2.2 非常慢。例如，有 1000 个文档，对于包含 2313 个唯一单词的文档，输出结果大约需要 5 分钟。

有没有其他方法可以使步骤 2.2 更快？字典的迭代速度很慢吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

仅一夜美梦 2024-12-08 11:45:11

好吧，您必须以某种方式重新思考和重新设计保存数据的方式，或者换句话说，实现“倒排索引”的“正统”版本。

您的瓶颈是术语的文档频率 (DF) 的“即时”计算。动态化是一个聪明的主意，因此每次更新语料库（文档集合）时，请进行一些处理并更新文档中每个术语的 DF（当然，以持久的方式保存结果），又名数据库等..）。

您需要的唯一结构是一个嵌套字典，就像

{ "term1" : { "DF" : x, "some_doc_id" : tf , "some_other_doc_id" : tf, etc  } ,
  "term2" : ...
  etc..
}

每次“喂养”语料库时正确更新的那样。

当然，将你的语料库基数保留在某个地方......

作为一种爱好和我工作的一部分，我正在实现一个 python - redis 支持的小型搜索引擎。您可能还会得到一些其他想法。请查看此处。

Well, you have to re-think and re-design somehow the way you hold your data, or in other words, implement an "orthodox" version of your "inverted index".

Your bottleneck is the "on-the-fly" calculation of the document frequency (DF) for the terms. It would be a clever idea for this to be dynamic, so every time you update your corpus (collection of documents), do some processing and update the DFs for every term in a document (and of course, save the results in a persistent way, aka a database etc..) .

The only structure you need is a nested dictionary like that

{ "term1" : { "DF" : x, "some_doc_id" : tf , "some_other_doc_id" : tf, etc  } ,
  "term2" : ...
  etc..
}

properly updated every time you "feed" your corpus.

And of course, keep somewhere your corpus cardinality...

As a hobby and part of my work, I am implementing a python - redis backed small search engine. You might get some other ideas as well. Take a look here.

回复收藏 0 原文