有效地为字符串簇选择标题（簇的中心）

发布于 2024-12-05 06:56:18 字数 801 浏览 8 评论 0原文

我有一个（不完美的）集群字符串数据，其中一个集群中的项目可能如下所示：

[ 
  Yellow ripe banana very tasty,
  Yellow ripe banana with little dots,
  Green apple with little dots,
  Green ripe banana - from the market, 
  Yellow ripe banana,
  Nice yellow ripe banana,
  Cool yellow ripe banana - my favourite,
  Yellow ripe,
  Yellow ripe
],

其中最佳标题为“黄色熟香蕉”。

目前，我正在使用简单的启发式方法 - 在 SQL GROUP BY 的帮助下选择最常见的名称或最短的名称（如果相同）。我的数据包含大量这样的簇，它们经常变化，并且每次在簇中添加或删除新的水果时，都必须重新计算簇的标题。

我想改进两件事：

（1）效率 - 例如，仅将新的水果名称与聚类的标题进行比较，并避免每次对所有水果标题进行分组/短语聚类。

(2) 精确 - 我不想寻找最常见的完整名称，而是提取最常见的短语。目前的算法会选择“Yellow熟”，它重复2次，是最常见的完整短语；然而，短语“Yellow熟香蕉”是给定集合中最常见的。

我正在考虑使用 Solr + Carrot2 （没有第二个的经验）。此时，我不需要对文档进行聚类 - 它们已经根据其他参数进行聚类 - 我只需要选择中心短语作为聚类的中心/标题。

非常感谢任何意见，谢谢！

原文

I have an (imperfectly) clustered string data, where the items in one cluster might look like this:

[ 
  Yellow ripe banana very tasty,
  Yellow ripe banana with little dots,
  Green apple with little dots,
  Green ripe banana - from the market, 
  Yellow ripe banana,
  Nice yellow ripe banana,
  Cool yellow ripe banana - my favourite,
  Yellow ripe,
  Yellow ripe
],

where the optimal title would be 'Yellow ripe banana'.

Currently, I am using simple heuristics - choosing the most common, or the shortest name if tie, - with the help of SQL GROUP BY. My data contains a large amount of such clusters, they change frequently, and, every time a new fruit is added to or removed from the cluster, the title for the cluster has to be re-calculated.

I would like to improve two things:

(1) Efficiency - e.g., compare the new fruit name to the title of the cluster only, and avoid grouping / phrase clustering of all fruit titles each time.

(2) Precision - instead of looking for the most common complete name, I would like to extract the most common phrase. The current algorithm would choose 'Yellow ripe', which repeats 2 times and is the most common complete phrase; however, as the phrase, 'Yellow ripe banana' is the most common in the given set.

I am thinking of using Solr + Carrot2 (got no experience with the second). At this point, I do not need to cluster the documents - they are already clustered based on other parameters - I only need to choose the central phrase as the center/title of the cluster.

Any input is very appreciated, thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

寄意 2024-12-12 06:56:18

Solr 提供了一个称为 ShingleFilter 的分析组件，您可以使用它从相邻单词组创建标记。如果您将其放入分析链中（即，在索引传入文档时将其应用于传入文档），然后使用仅限于“水果簇”的查询计算结果字段的方面，您将能够获得所有内容的列表不同的带状疱疹及其出现频率 - 我认为您甚至可以检索按频率排序的它们 - 我认为您可以轻松使用它来导出您想要的标题。然后，当您添加新水果时，其木瓦将自动包含在下一次的刻面计算中。

该提案的更具体版本是：

创建两个字段：fruit_shingle 和 cluster_id。

使用ShingleFilter 和您可能想要的任何其他处理配置fruit_shingle（例如在ShingleFilter 之前使用StandardTokenizer 在单词边界处进行标记）。

使用用于识别集群的任何数据将 cluster_id 配置为唯一 ID。

对于每个新水果，将其文本存储在fruit_shingle中，将其id存储在cluster_id中。

然后检索查询的分面：“cluster_id：”，您将获得单词、单词对、单词三元组等（带状疱疹）的列表。我相信，您可以将 ShingleFilter 配置为最大长度。按您认为合适的长度和/或频率的某种组合对面进行排序，并将其用作果串的“标题”。

回复收藏 0 原文

~没有更多了~