非常快的文档相似度

发布于 2024-09-01 10:41:32 字数 550 浏览 12 评论 0原文

我试图尽快确定单个文档与大量文档(n ~= 100 万)中的每个文档之间的文档相似性。更具体地说,我正在比较的文档是电子邮件;它们被分组(即有文件夹或标签),我想确定哪个组最适合新电子邮件。快速的性能至关重要。

我的先验假设是术语向量之间的余弦相似度适合此应用程序;请评论这是否是一个好的措施!

我已经考虑了以下加快性能的可能性:

  1. 预规范化所有项向量

  2. 计算每个组的项向量 (n ~= 10,000 ) 而不是每封电子邮件 (n ~= 1,000,000);这对于我的应用程序来说可能是可以接受的,但如果您能想到不这样做的原因,请告诉我!

我有几个问题:

  1. 如果一封新电子邮件包含以前任何电子邮件中从未见过的新术语,这是否意味着我需要重新计算我的所有术语向量?这看起来很昂贵。

  2. 是否有一些聪明的方法来只考虑可能接近查询文档的向量?

  3. 是否有某种方法可以更节省我用于所有这些向量的内存量?

谢谢!

I am trying to determine document similarity between a single document and each of a large number of documents (n ~= 1 million) as quickly as possible. More specifically, the documents I'm comparing are e-mails; they are grouped (i.e., there are folders or tags) and I'd like to determine which group is most appropriate for a new e-mail. Fast performance is critical.

My a priori assumption is that the cosine similarity between term vectors is appropriate for this application; please comment on whether this is a good measure to use or not!

I have already taken into account the following possibilities for speeding up performance:

  1. Pre-normalize all the term vectors

  2. Calculate a term vector for each group (n ~= 10,000) rather than each e-mail (n ~= 1,000,000); this would probably be acceptable for my application, but if you can think of a reason not to do it, let me know!

I have a few questions:

  1. If a new e-mail has a new term never before seen in any of the previous e-mails, does that mean I need to re-compute all of my term vectors? This seems expensive.

  2. Is there some clever way to only consider vectors which are likely to be close to the query document?

  3. Is there some way to be more frugal about the amount of memory I'm using for all these vectors?

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

清风疏影 2024-09-08 10:41:32

使用贝叶斯过滤。提供的链接涉及垃圾邮件过滤,但您可以轻松地使算法适应多个类别/标签。

还有很多关于贝叶斯过滤的好问题SO问题

Use Bayesian filtering. The link provided refers to spam filtering, but you can adapt the algorithm pretty easily to multiple categories/tags.

There are lots of good SO question about Bayesian filtering, too.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文