关键词/标签的空间映射

发布于 2024-07-13 07:07:07 字数 643 浏览 9 评论 0原文

我试图了解构建相关/常见关键字或标签的空间地图的策略或想法。以SO为例；如果您转到 https://stackoverflow.com/tags 并输入“python”，您将获得包含该单词的所有标签，但没有可能密切相关的标签（WSGI、Google 的 App Engine、飞行等）。

根据我的问题，如何构建一个可以查询的空间地图，以从搜索中查找密切相关的标签/关键字，并按其权重排序？但是，如何将标签 foo 的权重存储到可能更多的标签中，同时仍然保持系统响应呢？

我已经看过 David Weinberger 的 Google 技术演讲，这是一场精彩的技术演讲，引发了我的思考。 http://video.google.com/ videoplay?docid=2159021324062223592&ei=qseASZvgI6e4qAP91a2PDg&q=google+tech+talk

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦途 2024-07-20 07:07:07

查看 O'Reilly 的“集体智能编程”中的集群概念。

回复收藏 0 原文

仙气飘飘 2024-07-20 07:07:07

构建有关此类关系的数据的最可能的方法似乎是对最常一起出现的标签进行分类，同时与最少数量的其他标签一起出现。

也就是说，“c++”和“stl”经常一起出现，而“stl”很少（？）在没有“c++”的情况下出现，因此它们是相关的（至少在一个方向上）。 “c++”和“algorithm”也经常一起出现，但它们更经常分开出现，因此它们不相关。

回复收藏 0 原文

一曲爱恨情仇 2024-07-20 07:07:07

在考虑如何构建数据时，我的一个想法可能是四表系统。一个表将是源数据（例如，对于SO，必须有某种问题表），它连接到标签表，然后连接回标签表的标签权重表。

#pseudo code
     source table {
     id: int
     source_data: text   
     }

     source_tag table {
        source_id: int
        tag_id: int
     }

     tag table{
      id: int
      tag: String(30)
     }

    tag_weight table {
        base_tag_id: int
        weight: float( 0-10 or 100 ) or int ( count of mutual occurrence )
        source_tag_id: int      
    }

我不知道这个结构有多高效，但我想它值得改进。否则，为了使其正常工作，对源数据的新准入可能会触发更新后触发器，或者让后台的工作进程在预设时间重新平衡权重。

In thinking of how the data could be structured, one idea I had could possibly be a four tables system. one table would be source data (ex. with SO there has to be some sort of question table), which is joined to a tag table and then a tag weight table that joins back to the tag table.

#pseudo code
     source table {
     id: int
     source_data: text   
     }

     source_tag table {
        source_id: int
        tag_id: int
     }

     tag table{
      id: int
      tag: String(30)
     }

    tag_weight table {
        base_tag_id: int
        weight: float( 0-10 or 100 ) or int ( count of mutual occurrence )
        source_tag_id: int      
    }

I have no idea how efficient this structure is, but I suppose its something to work on. Otherwise to make it work, new admissions to source data could fire of an after update trigger or have a worker process in the background rebalance the weights at preset times.

回复收藏 0 原文