增量聚类
请建议一些有效的增量聚类方法。我正在尝试将类似的字符串放入一组。相互比较是没有效率的。我的想法是用集群代表检查每个输入字符串(这意味着该集群中的字符串有一个代表模式,以便新字符串只能与该集群进行比较)。因此,可以从任何事情开始,以便簇中几乎相似的字符串可以用一个通用模式(可能是)以尽可能高的精度表示。通过这种方式,新输入仅与集群代表进行比较,如果发现相似则保留在其中。簇和输入的数量不固定......字符串是流式的,并且可以是任何模式长度。
我希望我说清楚了。只需帮助我一些术语即可开始。
please suggest some way for efficient incremental clustering. I am trying to put similar strings to one group. comparing with each other is not efficient. what i have thought is to check the each input string with the cluster representative( this means there is one representative pattern for strings in that cluster so that the new string can be compared to that only). So, anything to start with so that the nearly similar strings in a cluster can be represented by one universal pattern(may be) with highest possible accuracy. In this way the new input are just compared with cluster representative and the kept into it if found similar. The number of cluster and input are not fixed...strings are streaming and may be of any pattern length.
I hope i was clear. Just help me with some term to get going.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
听起来问题中给您带来困难的部分是找到用于每个集群的代表性模式。
进行字符串聚类的常用方法是将它们视为向量并使用余弦相似度作为距离度量:http: //en.wikipedia.org/wiki/Cosine_distance
当簇中的字符串表示为向量时,那么我认为簇的中心就是归一化向量的总和。使用这个总和作为代表来比较每个新字符串。
It sounds like the part of the problem that is giving you difficulty is finding a representative pattern to use for each cluster.
The usual way to do clustering of strings is to treat them as vectors and use cosine similarity as the distance measure: http://en.wikipedia.org/wiki/Cosine_distance
When the strings in the cluster are represented as vectors, then I think the center of the cluster is just the sum of the normalized vectors. Use this sum as the representative to compare each new string against.