有效地为字符串簇选择标题(簇的中心)
我有一个(不完美的)集群字符串数据,其中一个集群中的项目可能如下所示:
[
Yellow ripe banana very tasty,
Yellow ripe banana with little dots,
Green apple with little dots,
Green ripe banana - from the market,
Yellow ripe banana,
Nice yellow ripe banana,
Cool yellow ripe banana - my favourite,
Yellow ripe,
Yellow ripe
],
其中最佳标题为“黄色熟香蕉”。
目前,我正在使用简单的启发式方法 - 在 SQL GROUP BY 的帮助下选择最常见的名称或最短的名称(如果相同)。我的数据包含大量这样的簇,它们经常变化,并且每次在簇中添加或删除新的水果时,都必须重新计算簇的标题。
我想改进两件事:
(1)效率 - 例如,仅将新的水果名称与聚类的标题进行比较,并避免每次对所有水果标题进行分组/短语聚类。
(2) 精确 - 我不想寻找最常见的完整名称,而是提取最常见的短语。目前的算法会选择“Yellow熟”,它重复2次,是最常见的完整短语;然而,短语“Yellow熟香蕉”是给定集合中最常见的。
我正在考虑使用 Solr + Carrot2 (没有第二个的经验)。此时,我不需要对文档进行聚类 - 它们已经根据其他参数进行聚类 - 我只需要选择中心短语作为聚类的中心/标题。
非常感谢任何意见,谢谢!
I have an (imperfectly) clustered string data, where the items in one cluster might look like this:
[
Yellow ripe banana very tasty,
Yellow ripe banana with little dots,
Green apple with little dots,
Green ripe banana - from the market,
Yellow ripe banana,
Nice yellow ripe banana,
Cool yellow ripe banana - my favourite,
Yellow ripe,
Yellow ripe
],
where the optimal title would be 'Yellow ripe banana'.
Currently, I am using simple heuristics - choosing the most common, or the shortest name if tie, - with the help of SQL GROUP BY. My data contains a large amount of such clusters, they change frequently, and, every time a new fruit is added to or removed from the cluster, the title for the cluster has to be re-calculated.
I would like to improve two things:
(1) Efficiency - e.g., compare the new fruit name to the title of the cluster only, and avoid grouping / phrase clustering of all fruit titles each time.
(2) Precision - instead of looking for the most common complete name, I would like to extract the most common phrase. The current algorithm would choose 'Yellow ripe', which repeats 2 times and is the most common complete phrase; however, as the phrase, 'Yellow ripe banana' is the most common in the given set.
I am thinking of using Solr + Carrot2 (got no experience with the second). At this point, I do not need to cluster the documents - they are already clustered based on other parameters - I only need to choose the central phrase as the center/title of the cluster.
Any input is very appreciated, thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Solr 提供了一个称为 ShingleFilter 的分析组件,您可以使用它从相邻单词组创建标记。如果您将其放入分析链中(即,在索引传入文档时将其应用于传入文档),然后使用仅限于“水果簇”的查询计算结果字段的方面,您将能够获得所有内容的列表不同的带状疱疹及其出现频率 - 我认为您甚至可以检索按频率排序的它们 - 我认为您可以轻松使用它来导出您想要的标题。然后,当您添加新水果时,其木瓦将自动包含在下一次的刻面计算中。
该提案的更具体版本是:
创建两个字段:fruit_shingle 和 cluster_id。
使用ShingleFilter 和您可能想要的任何其他处理配置fruit_shingle(例如在ShingleFilter 之前使用StandardTokenizer 在单词边界处进行标记)。
使用用于识别集群的任何数据将 cluster_id 配置为唯一 ID。
对于每个新水果,将其文本存储在fruit_shingle中,将其id存储在cluster_id中。
然后检索查询的分面:“cluster_id:”,您将获得单词、单词对、单词三元组等(带状疱疹)的列表。我相信,您可以将 ShingleFilter 配置为最大长度。按您认为合适的长度和/或频率的某种组合对面进行排序,并将其用作果串的“标题”。
Solr provides an analysis component called a ShingleFilter that you can use to create tokens from groups of adjacent words. If you put that in your analysis chain (ie apply it it incoming documents when you index them), and then compute facets for the resulting field with a query restricted to the "fruit cluster", you will be able to get a list of all distinct shingles along with their occurrence frequencies - I think you can even retrieve them sorted by frequency - which you can use easily I think to derive the title you want. Then when you add a new fruit, its shingles will automatically be included in the facet computations the next time around.
Just a bit more concrete version of this proposal:
create two fields: fruit_shingle, and cluster_id.
Configure fruit_shingle with the ShingleFilter and any other processing you might want (like tokenizing at word boundaries with maybe StandardTokenizer, prior to the ShingleFilter).
Configure cluster_id as a unique id, using whatever data you use to identify the clusters.
For each new fruit, store its text in fruit_shingle and its id in cluster_id.
Then retrieve facets for a query: "cluster_id:", and you will get a list of words, word pairs, word triplets, etc (shingles). You can configure the ShingleFilter to have a max length, I believe. Sort the facets by some combination of length and/or frequency that you deem appropriate and use that as the "title" of the fruit cluster.