“考虑包括”如何? Gmail 中的功能已实现吗?

发布于 2025-01-05 08:52:17 字数 735 浏览 2 评论 0 原文

我想做一些类似于 gmail 在我的博客上的“考虑包含”建议,但带有标签。

我正在考虑存储这样的标签集: 三表

我想到了以下算法:

//a blog post is published
//it has the tags "A", "B" & "C" :
if the tag set "A,B,C" doesn't exist
   create it
else
   add 1 to "number of times used"

并且,建议标签:

//a blog post is being written.
//the author includes the tags "A" and "C"
//which tags should I suggest ?
find all the tags sets that contain "A" and "C"
  among them, find the one with the highest "number of times used"
    suggest the tags of the set not already picked (A & C)

是否有更好/更智能的方法来完成此任务任务 ?数据库模型怎么样?我可以对其进行优化,以便像“包含 A 和 C 的集合”这样的搜索不会太慢吗?

I would like to do something similar to gmail's "consider including" suggestions on my blog, but with tags.

I was thinking of storing tags sets like this :
three tables

and I thought of the following algorithm :

//a blog post is published
//it has the tags "A", "B" & "C" :
if the tag set "A,B,C" doesn't exist
   create it
else
   add 1 to "number of times used"

and, to suggest tags :

//a blog post is being written.
//the author includes the tags "A" and "C"
//which tags should I suggest ?
find all the tags sets that contain "A" and "C"
  among them, find the one with the highest "number of times used"
    suggest the tags of the set not already picked (A & C)

Is there a better/smarter way of accomplishing this task ? What about the database model ? Can I optimize it so that searches like "sets that contain A & C" won't be too slow ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

我的鱼塘能养鲲 2025-01-12 08:52:17

搜索模型问题:

您的模型对我来说似乎有点过于简化,因为非常频繁的标签很可能始终是建议的标签,即使有与 A,C 对更相关的标签。

您可能应该考虑 tf-idf 模型,如果它们也连接到“查询”[这里的查询是A和B],因为如果一个罕见的术语通常与A和B一起使用 - 它可能是与他们有很大关系。

这个想法很简单:如果一个标签经常与 A 和 B 一起使用 - 那就加强它。 [tf]

此外,如果某个术语很少见 [此标签的总使用次数] - 给予它一个提升 [idf]

每个标签的“分数”将是 tf-idf 分数的总和

性能问题:

您还可以考虑为此任务创建一个倒排索引 - 以加快搜索速度。

如果您使用java,apache lucene是一个成熟的库,可以帮助您。

Search model issues:

Your model seems a bit too simplified to me, since very frequent tags are most likely to always be the suggested ones, even if there are tags more related to the pair A,C.

You probably should concider the tf-idf model, which gives a boost to rare terms, if they are also connected to the "query" [in here the query is A and B], since if a rare term is commonly used with A and B - it is probably very much related to them.

The idea is simple: If a tag is frequently used with A and B - give it a boost. [tf]

Also, if a term is rare [number of total uses of this tag] - give it a boost [idf]

The "score" for each tag will be the combined tf-idf score

Performance issues:

You might also concider for this task creating an inverted index - to speed up searches.

If you are using java, apache lucene is a mature library that can help you with it.

梦魇绽荼蘼 2025-01-12 08:52:17

我认为这是典型的数据关联挖掘和推荐问题。你可以尝试google Apriori算法进行数据挖掘,并做出TOP N推荐。

您的解决方案可行,但在我的选择中并不全面。例如集合“A,B”和集合“A,B,C”不是独立集合。

I think this is typical data association mining and recommendation problem. You can try google Apriori algorithm for data mining and make a TOP N recommendation.

Your solution will work but not comprehensive in my option. such as set "A,B" and set "A,B,C" are not independent sets.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文