将文档添加到评分的 TF-IDF 集合中?
我有大量已计算 TF-IDF 的文档。我正准备向集合中添加更多文档,我想知道是否有一种方法可以将 TF-IDF 分数添加到新文档中,而无需重新处理整个数据库?
I have a large collection of documents that already have their TF-IDF computed. I'm getting ready to add some more documents to the collection, and I am wondering if there is a way to add TF-IDF scores to the new documents without re-processing the entire database?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
基本上有两个选项:
仅在需要时计算 tf-idf 分数。添加新文档现在很简单。您所要做的就是更新所有文档的数量、出现标记的文档的数量,并存储新文档的标记出现向量。
定期重新计算您的 tf-idf 向量,可能是在添加 100K 文档或类似内容之后。在这之间,只需使用旧值(所有文档的数量、标记出现的文档数量)。
如果您的集合确实很大,您可能会想要采用第二种方法,因为新文档无论如何都不会改变单词的全局分布。也就是说,最好测试这两种方法并选择最适合您问题的一种。
Basically there are two options:
Compute your tf-idf scores only when you need them. Adding a new document is now trivial. All you'll have to do is to update the number of all documents, the number of documents in which a token occurs and to store the token occurence vector for the new document.
Periodically recalc your tf-idf vectors, maybe after adding 100K documents or something like that. In between, just work with the old values (number of all documents, number of documents a token occurs in).
If your collection is really large, you'll probably want to take the second approach, because new documents won't change the global distribution of words much anyway. That said, it's better to test both methods and settle for the one that fits your problem best.