标签层次结构和处理

发布于 2024-07-06 02:58:51 字数 406 浏览 11 评论 0原文

这是一个普遍适用于标记项目的真实问题（是的，这也适用于 StackOverflow，不，这不是关于 StackOverflow 的问题）。

整个标记问题有助于对相似的项目进行聚类，无论它们是什么（笑话、博客文章、问题等）。然而，（通常但不严格）存在标签的层次结构，这意味着某些标签也暗示其他标签。使用一个熟悉的示例，“c#”so 标记也意味着“.net”；另一个例子，在笑话数据库中，“金发女郎”标签意味着“嘲笑”标签，类似于“爱尔兰”或“比利时”或“加拿大”等，具体取决于笑话的国家起源。

如果有的话，在你的项目中你是如何处理这个问题的？我将提供一个答案，描述我在两个不同的情况下使用的两种不同的方法（实际上，相同的机制，但在两个不同的环境中实现），但我不仅对类似的机制感兴趣，而且对您对层次结构问题的看法感兴趣。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

兮子 2024-07-13 02:58:51

这是一个很难回答的问题。两个极端是本体论（一切都是分层的）和大众分类法（标签没有分层结构）。我在 WikiAnswers 上回答了这个问题，并参考了 Clay Shirky的“本体论被高估了”文章声称你不应该设置任何层次结构。

回复收藏 0 原文

叹梦 2024-07-13 02:58:51

实际上我想说，与其说它是一个层次系统，不如说它是一个语义网，标签含义之间存在着距离。我的意思是：数学更接近实验物理学，然后更接近园艺。

构建这样一个网络的可能性：构建标签对并让人们判断感知距离（使用像 1-10 这样的度量，意味着类似[同义词，相似，...，反义词]，...），并且在搜索时，搜索一定距离内的所有标签。

如果来自相反方向（[a,b] 接近 -> [b,a,] 接近），测量是否必须具有相等的距离？或者接近是否意味着[a，b]接近并且[b，c]接近->？ [a,b] 接近？

也许第一个单词默认会触发另一个语义字段？如果你从“社会工作者”开始，“分析师”就很近了。如果你从“程序员”开始，“分析师”也很接近。但从这些点中的任何一点开始，您可能都不会认为另一个点很接近（“社会工作者”绝不接近“程序员”）。

因此，您只能对两个方向进行判断和判断（以随机顺序）。

[TagRelations]
tagId integer
closeTagId integer
proximity integer

选择相似标签的示例：

select closeTagId from TagRelations where tagId = :tagID and proximity < 3

Actually I would say that it is not so much a hierarchical system but a semantic net with felt distancies between tags meanings. What do I mean: mathematics is closer to experimental physics then to gardening.

Possibility to build such a net: Build pairs of tags and let people judge the perceived distance (using a measure like 1-10, meaning something like [synonyms, alike,...,antonyms], ...) and when searching, search for all tags within a certain distance.

Does a measure have to be equal distance if coming from the oposite direction ([a,b] close -> [b,a,] close)? Or does proximity imply [a,b] close and [b,c] close -> [a,b] close?

Maybe the first word will by default trigger another semantic field? If you start at "social worker", "analyst" ist near. If you start at "programmer", "analyst" is near as well. But starting at any of these points, you probably would not count the other as near ("sozial worker" is by no means close to "programmer").

You therefore would have only pairs judged and judged in both directions (in random order).

[TagRelations]
tagId integer
closeTagId integer
proximity integer

Example for selection of similar tags:

select closeTagId from TagRelations where tagId = :tagID and proximity < 3

回复收藏 0 原文

你的背包 2024-07-13 02:58:51

我实现的机制是不使用给定的标签本身，而是使用间接查找表（严格来说不是 DBMS 术语），该表将标签链接到许多隐含标签（显然，标签与其自身链接才能工作）。

在 python 项目中，查找表是一个以标签为键的字典，其中包含标签的值集（其中标签是纯字符串）。

在一个数据库项目中（不管它是哪个 RDBMS 引擎），有以下表格：

[Tags]
tagID integer primary key
tagName text

[TagRelations]
tagID integer # first part of two-field key
tagID_parent integer # second part of key
trlValue float

其中 trlValue 是 (0, 1] 空间中的值，用于为每个链接的标签提供重力； self 标签关系在 trlValue 中始终带有 1.0，而其余部分是通过算法计算的（具体如何并不重要）；一条 ['blonde', 'derisive', 0.5] 记录将与 [ 相关联。 'pondian', '嘲笑', 0.5] 因此建议所有嘲笑的笑话给另一个。

The mechanism I have implemented was to not use the tags given themselves, but an indirect lookup table (not strictly DBMS terms) which links a tag to many implied tags (obviously, a tag is linked with itself for this to work).

In a python project, the lookup table is a dictionary keyed on tags, with values sets of tags (where tags are plain strings).

In a database project (indifferent which RDBMS engine it was), there were the following tables:

[Tags]
tagID integer primary key
tagName text

[TagRelations]
tagID integer # first part of two-field key
tagID_parent integer # second part of key
trlValue float

where the trlValue was a value in the (0, 1] space, used to give a gravity for the each linked tag; a self-to-self tag relation always carries 1.0 in the trlValue, while the rest are algorithmically calculated (it's not important how exactly). Think the example jokes database I gave; a ['blonde', 'derisive', 0.5] record would correlate to a ['pondian', 'derisive', 0.5] and therefore suggest all derisive jokes given another.

回复收藏 0 原文

~没有更多了~