关键词/标签的空间映射

发布于 2024-07-13 07:07:07 字数 643 浏览 9 评论 0原文

我试图了解构建相关/常见关键字或标签的空间地图的策略或想法。 以SO为例; 如果您转到 https://stackoverflow.com/tags 并输入“python”,您将获得包含该单词的所有标签,但没有可能密切相关的标签(WSGI、Google 的 App Engine、飞行等)。

根据我的问题,如何构建一个可以查询的空间地图,以从搜索中查找密切相关的标签/关键字,并按其权重排序? 但是,如何将标签 foo 的权重存储到可能更多的标签中,同时仍然保持系统响应呢?

我已经看过 David Weinberger 的 Google 技术演讲,这是一场精彩的技术演讲,引发了我的思考。 http://video.google.com/ videoplay?docid=2159021324062223592&ei=qseASZvgI6e4qAP91a2PDg&q=google+tech+talk

I'm trying to understand the stategy or idea's for building spacial maps of related/common keywords or tags. Using SO as an example; if you go to https://stackoverflow.com/tags and type in "python" you will get all tags that have that word in it, but no tags that might be closely related ( WSGI, Google's App Engine, flying, etc ).

In line with my question, how could you build a spatial map that could be queried to find closely related tags/keywords from the search, ordered by their weight? But then how to store say tag foo's weight to a potentially larger number of tags and still keep the system responsive?

I've already watched this Google Tech-talk by David Weinberger which is a great tech talk that has gotten me thinking.
http://video.google.com/videoplay?docid=2159021324062223592&ei=qseASZvgI6e4qAP91a2PDg&q=google+tech+talk

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

梦途 2024-07-20 07:07:07

查看 O'Reilly 的“集体智能编程”中的集群概念。

Check the clustering concepts from O'Reilly's "Programming Collective Intelligence".

仙气飘飘 2024-07-20 07:07:07

构建有关此类关系的数据的最可能的方法似乎是对最常一起出现的标签进行分类,同时与最少数量的其他标签一起出现。

也就是说,“c++”和“stl”经常一起出现,而“stl”很少(?)在没有“c++”的情况下出现,因此它们是相关的(至少在一个方向上)。 “c++”和“algorithm”也经常一起出现,但它们更经常分开出现,因此它们不相关。

It seems that the most likely way to build the data regarding such relationships would be to catalog which tags appear together the most often, while appearing together with the least number of other tags.

That is, "c++" and "stl" appear together a lot, and "stl" rarely(?) appears without "c++", so they are related (in at least one direction). "c++" and "algorithm" also appear together a lot, but they appear apart even more often, so they are not related.

一曲爱恨情仇 2024-07-20 07:07:07

在考虑如何构建数据时,我的一个想法可能是四表系统。 一个表将是源数据(例如,对于SO,必须有某种问题表),它连接到标签表,然后连接回标签表的标签权重表。

#pseudo code
     source table {
     id: int
     source_data: text   
     }

     source_tag table {
        source_id: int
        tag_id: int
     }

     tag table{
      id: int
      tag: String(30)
     }

    tag_weight table {
        base_tag_id: int
        weight: float( 0-10 or 100 ) or int ( count of mutual occurrence )
        source_tag_id: int      
    }

我不知道这个结构有多高效,但我想它值得改进。 否则,为了使其正常工作,对源数据的新准入可能会触发更新后触发器,或者让后台的工作进程在预设时间重新平衡权重。

In thinking of how the data could be structured, one idea I had could possibly be a four tables system. one table would be source data (ex. with SO there has to be some sort of question table), which is joined to a tag table and then a tag weight table that joins back to the tag table.

#pseudo code
     source table {
     id: int
     source_data: text   
     }

     source_tag table {
        source_id: int
        tag_id: int
     }

     tag table{
      id: int
      tag: String(30)
     }

    tag_weight table {
        base_tag_id: int
        weight: float( 0-10 or 100 ) or int ( count of mutual occurrence )
        source_tag_id: int      
    }

I have no idea how efficient this structure is, but I suppose its something to work on. Otherwise to make it work, new admissions to source data could fire of an after update trigger or have a worker process in the background rebalance the weights at preset times.

_畞蕅 2024-07-20 07:07:07

您需要一个好的搜索引擎。 ;)

自己动手:实现一种相似性算法。 例如:Levenshtein 距离骰子系数

或者使用现成的东西,例如 Lucene

You need a good search engine. ;)

Do it yourself: implementing one of the similarity algorithms. For example: Levenshtein distance or Dice's coefficient.

Or use something ready to use like Lucene.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文