标签层次结构和处理
这是一个普遍适用于标记项目的真实问题(是的,这也适用于 StackOverflow,不,这不是关于 StackOverflow 的问题)。
整个标记问题有助于对相似的项目进行聚类,无论它们是什么(笑话、博客文章、问题等)。 然而,(通常但不严格)存在标签的层次结构,这意味着某些标签也暗示其他标签。 使用一个熟悉的示例,“c#”so 标记也意味着“.net”; 另一个例子,在笑话数据库中,“金发女郎”标签意味着“嘲笑”标签,类似于“爱尔兰”或“比利时”或“加拿大”等,具体取决于笑话的国家起源。
如果有的话,在你的项目中你是如何处理这个问题的? 我将提供一个答案,描述我在两个不同的情况下使用的两种不同的方法(实际上,相同的机制,但在两个不同的环境中实现),但我不仅对类似的机制感兴趣,而且对您对层次结构问题的看法感兴趣。
This is a real issue that applies on tagging items in general (and yes, this applies to StackOverflow too, and no, it is not a question about StackOverflow).
The whole tagging issue helps cluster similar items, whatever items they may be (jokes, blog posts, so questions etc). However, there (usually but not strictly) is a hierarchy of tags, meaning that some tags imply other tags too. To use a familiar example, the "c#" so tag implies also ".net"; another example, in a jokes database, a "blondes" tag implies the "derisive" tag, similarly to "irish" or "belge" or "canadian" etc depending on the joke's country origin.
How have you handled this, if you have, in your projects? I will supply an answer describing two different methods I have used in two separate cases (actually, the same mechanism but implemented in two different environments), but I am also interested not only on similar mechanisms, but also on your opinion on the hierarchy issue.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是一个很难回答的问题。 两个极端是本体论(一切都是分层的)和大众分类法(标签没有分层结构)。 我在 WikiAnswers 上回答了这个问题,并参考了 Clay Shirky的“本体论被高估了”文章声称你不应该设置任何层次结构。
This is a tough question. The two extremes are an ontology (everything is hierarchical) and a folksonomy (tags have no hierarchy). I have answered this on WikiAnswers, with a reference to Clay Shirky's "Ontology is Overrated" article which claims you should set no hierarchy.
实际上我想说,与其说它是一个层次系统,不如说它是一个语义网,标签含义之间存在着距离。 我的意思是:数学更接近实验物理学,然后更接近园艺。
构建这样一个网络的可能性:构建标签对并让人们判断感知距离(使用像 1-10 这样的度量,意味着类似[同义词,相似,...,反义词],...),并且在搜索时,搜索一定距离内的所有标签。
如果来自相反方向([a,b] 接近 -> [b,a,] 接近),测量是否必须具有相等的距离? 或者接近是否意味着[a,b]接近并且[b,c]接近->? [a,b] 接近?
也许第一个单词默认会触发另一个语义字段? 如果你从“社会工作者”开始,“分析师”就很近了。 如果你从“程序员”开始,“分析师”也很接近。 但从这些点中的任何一点开始,您可能都不会认为另一个点很接近(“社会工作者”绝不接近“程序员”)。
因此,您只能对两个方向进行判断和判断(以随机顺序)。
选择相似标签的示例:
Actually I would say that it is not so much a hierarchical system but a semantic net with felt distancies between tags meanings. What do I mean: mathematics is closer to experimental physics then to gardening.
Possibility to build such a net: Build pairs of tags and let people judge the perceived distance (using a measure like 1-10, meaning something like [synonyms, alike,...,antonyms], ...) and when searching, search for all tags within a certain distance.
Does a measure have to be equal distance if coming from the oposite direction ([a,b] close -> [b,a,] close)? Or does proximity imply [a,b] close and [b,c] close -> [a,b] close?
Maybe the first word will by default trigger another semantic field? If you start at "social worker", "analyst" ist near. If you start at "programmer", "analyst" is near as well. But starting at any of these points, you probably would not count the other as near ("sozial worker" is by no means close to "programmer").
You therefore would have only pairs judged and judged in both directions (in random order).
Example for selection of similar tags:
我实现的机制是不使用给定的标签本身,而是使用间接查找表(严格来说不是 DBMS 术语),该表将标签链接到许多隐含标签(显然,标签与其自身链接才能工作)。
在 python 项目中,查找表是一个以标签为键的字典,其中包含标签的值集(其中标签是纯字符串)。
在一个数据库项目中(不管它是哪个 RDBMS 引擎),有以下表格:
其中 trlValue 是 (0, 1] 空间中的值,用于为每个链接的标签提供重力; self 标签关系在 trlValue 中始终带有 1.0,而其余部分是通过算法计算的(具体如何并不重要);一条 ['blonde', 'derisive', 0.5] 记录将与 [ 相关联。 'pondian', '嘲笑', 0.5] 因此建议所有嘲笑的笑话给另一个。
The mechanism I have implemented was to not use the tags given themselves, but an indirect lookup table (not strictly DBMS terms) which links a tag to many implied tags (obviously, a tag is linked with itself for this to work).
In a python project, the lookup table is a dictionary keyed on tags, with values sets of tags (where tags are plain strings).
In a database project (indifferent which RDBMS engine it was), there were the following tables:
where the trlValue was a value in the (0, 1] space, used to give a gravity for the each linked tag; a self-to-self tag relation always carries 1.0 in the trlValue, while the rest are algorithmically calculated (it's not important how exactly). Think the example jokes database I gave; a ['blonde', 'derisive', 0.5] record would correlate to a ['pondian', 'derisive', 0.5] and therefore suggest all derisive jokes given another.