聚类(尤其是字符串聚类)如何工作?
我听说过聚类来对相似的数据进行分组。我想知道它在 String 的特定情况下如何工作。
我有一个包含超过 100,000 个不同单词的表。
我想识别具有一些差异的相同单词(例如:house、house!!、hooouse、HoUse、@house、“house”等...
)。
需要什么来识别相似性并将每个单词分组到一个簇中?对此更推荐什么算法?
I heard about clustering to group similar data. I want to know how it works in the specific case for String.
I have a table with more than different 100,000 words.
I want to identify the same word with some differences (eg.: house, house!!, hooouse, HoUse, @house, "house", etc...
).
What is needed to identify the similarity and group each word in a cluster? What algorithm is more recommended for this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
要理解什么是聚类,请想象一张地理地图。您可以看到许多不同的物体(例如房屋)。它们有的距离很近,有的距离很远。在此基础上,您可以将所有对象分成组(例如城市)。聚类算法正是做到这一点 - 它们允许您将数据分成组,而无需事先指定组边界。
所有聚类算法都基于两个对象之间的距离(或可能性)。在地理地图上是两栋房子之间的正常距离,在多维空间中可能是欧几里德距离(事实上,地图上两栋房子之间的距离也是欧几里德距离)。对于字符串比较,您必须使用不同的东西。这里有两个不错的选择:Hamming 和 编辑距离。在您的特定情况下,编辑距离(汉明距离仅适用于相同大小的字符串)。
现在您可以使用现有的一种聚类算法。它们有很多,但并非所有都能满足您的需求。例如,这里已经提到的纯 k 均值对您几乎没有帮助,因为它需要查找初始组数,并且对于大型字符串字典,它可能是 100、200、500、10000 - 您只是不知道数字。所以其他算法可能更合适。
其中之一是期望最大化算法。它的优点是可以自动找到簇的数量。然而,在实践中,它给出的结果往往不如其他算法精确,因此通常在 EM 之上使用 k-means,即首先用 EM 找到簇的数量及其中心,然后再使用使用k-means调整结果。
可能适合您的任务的另一个可能的算法分支是分层聚类。在这种情况下,聚类分析的结果不是一组独立的组,而是树(层次结构),其中几个较小的聚类被分组为一个较大的聚类,并且所有聚类最终都是一个大聚类的一部分。就您而言,这意味着所有单词在某种程度上都彼此相似。
To understand what clustering is imagine a geographical map. You can see many distinct objects (such as houses). Some of them are close to each other, and others are far. Based on this, you can split all objects into groups (such as cities). Clustering algorithms make exactly this thing - they allow you to split your data into groups without previous specifying groups borders.
All clustering algorithms are based on the distance (or likelihood) between 2 objects. On geographical map it is normal distance between 2 houses, in multidimensional space it may be Euclidean distance (in fact, distance between 2 houses on the map also is Euclidean distance). For string comparison you have to use something different. 2 good choices here are Hamming and Levenshtein distance. In your particular case Levenshtein distance if more preferable (Hamming distance works only with the strings of same size).
Now you can use one of existing clustering algorithms. There's plenty of them, but not all can fit your needs. For example, pure k-means, already mentioned here will hardly help you since it requires initial number of groups to find, and with large dictionary of strings it may be 100, 200, 500, 10000 - you just don't know the number. So other algorithms may be more appropriate.
One of them is expectation maximization algorithm. Its advantage is that it can find number of clusters automatically. However, in practice often it gives less precise results than other algorithms, so it is normal to use k-means on top of EM, that is, first find number of clusters and their centers with EM and then use k-means to adjust the result.
Another possible branch of algorithms, that may be suitable for your task, is hierarchical clustering. The result of cluster analysis in this case in not a set of independent groups, but rather tree (hierarchy), where several smaller clusters are grouped into one bigger, and all clusters are finally part of one big cluster. In your case it means that all words are similar to each other up to some degree.
有一个名为 stringdist 的包,允许使用多个字符串进行比较不同的方法。从该页面复制粘贴:
这会给你距离。您可能不需要执行聚类分析,也许按字符串距离本身排序就足够了。我创建了一个脚本来提供基本功能此处..请随意根据需要进行改进。
There is a package called stringdist that allows for string comparison using several different methods. Copypasting from that page:
That will give you the distance. You might not need to perform a cluster analysis, perhaps sorting by the string distance itself is sufficient. I have created a script to provide the basic functionality here... feel free to improve it as needed.
您可以使用 Levenshtein 距离 等算法进行距离计算,并使用
k-means
用于聚类。做一些测试并找到每个单词的相似性阈值来决定您的组。
You can use an algorithm like the Levenshtein distance for the distance calculation and
k-means
for clustering.Do some testing and find a similarity threshold per word that will decide your groups.
您可以使用称为“亲和力传播”的聚类算法。该算法接受一个称为相似度矩阵的输入,如果您使用的是 Python,则可以通过取 Levenstein 距离的负值或 fuzzywuzzy 库中的partial_ratio 和 token_set_ratio 的调和平均值来生成该矩阵。
You can use a clustering algorithm called "Affinity Propagation". This algorithm takes in an input called similarity matrix which you can generate by taking negative of the either Levenstein distance or an harmonic mean of partial_ratio and token_set_ratio from fuzzywuzzy library if you are using Python.