Python KMeans Orange 框架

发布于 2024-08-20 17:26:29 字数 614 浏览 7 评论 0原文

我计划使用 orange 进行 kmeans 聚类。我已经完成了教程,但我仍然有几个问题想问:

我正在处理高维向量的聚类。 1)是否实现了余弦距离? 2)我不想给空值加零。我尝试在空字段中不包含任何零,但收到错误:

SystemError: 'orange.TabDelimExampleGenerator': the number of attribute types does not match the number of attributes

如何指示空值? 3)有没有办法将“ID”合并到示例表中?我想通过 ID(而不是分类)来标记我的数据,以便于参考。我不将 ID 列作为我的数据的正式部分。

4)有没有办法为kmeans聚类输出不同的结果? 我更喜欢这种格式的东西:

cluster1: [ <id1>, <id2>, ...]
cluster2: [ <id3>, ... ]
rather than just [1, 2, 3,1 , 2, ... ]

谢谢!

I am planning to use orange for kmeans clustering. I have gone through the tutorials, but I still have a couple of questions which I would like to ask:

I am dealing with clustering on vectors of high dimension.
1) Is there a cosine distance implemented?
2) I do not want to give zeros to empty values. I tried not having any zeros in empty fields and am getting the error:

SystemError: 'orange.TabDelimExampleGenerator': the number of attribute types does not match the number of attributes

How do I indicate an empty value?
3) Is there a way to use incorporate an "ID" into the example table? I want to label my data by an ID (NOT classification) for easier reference. I do not the ID column to be my official part of my data.

4) Is there a way to output differently for kmeans clustering?
I would much prefer something in this format:

cluster1: [ <id1>, <id2>, ...]
cluster2: [ <id3>, ... ]
rather than just [1, 2, 3,1 , 2, ... ]

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

无人问我粥可暖 2024-08-27 17:26:29

一题四题极其尴尬——为什么不把一题变成一题呢?这并不是说它会让你付出代价;-)。无论如何,关于“如何指示空值?”,请参阅文档< /a> 关于 Orange.Value 实例的属性 value

如果值是连续的或未知的,则否
需要描述符。对于后者,
结果是字符串“?”、“~”或“.”
因为不知道、不关心等等,
分别。

我不确定“空”是否意味着“不知道”或“不在乎”,但无论如何你都可以指出。但是,请注意距离 - 从文档中的另一页

正确处理未知值
仅通过欧几里德距离和浮雕距离。
对于其他距离测量,a
未知和已知之间的距离或
两个未知值之间总是
0.5。

后一页中列出的距离是汉明距离、最大距离、曼哈顿距离、欧几里得距离和浮雕距离(后者类似于曼哈顿距离,但对未知值进行了正确处理)——没有提供余弦距离:您必须自己编码。

对于 (4),只需一点 Python 代码,您显然可以按照您想要的任何方式格式化结果。 KMeans 对象的 .clusters 属性是一个列表,与数据实例的数量一样长:如果您想要的是数据实例列表的列表,例如:

def loldikm(data, **k):
  km = orange.KMeans(data, **k)
  results = [[] for _ in km.centroids]
  for i, d in zip(km.clusters, data):
    results[i].append(d)

Four questions in one question is extremely awkward -- why not make a question one question? It's not as if it would cost you;-). Anyway, wrt "How do I indicate an empty value?", see the docs regarding attribute value of instances of Orange.Value:

If value is continuous or unknown, no
descriptor is needed. For the latter,
the result is a string '?', '~' or '.'
for don't know, don't care and other,
respectively.

I'm not sure if by empty you mean "don't know" or "don't care", but anyway you can indicate either. Take care about distances, however -- from this other page in the docs:

Unknown values are treated correctly
only by Euclidean and Relief distance.
For other measure of distance, a
distance between unknown and known or
between two unknown values is always
0.5.

The distances listed in this latter page are Hamming, Maximal, Manhattan, Euclidean and Relief (the latter is like Manhattan but with correct treatment of unknown values) -- no Cosine distance provided: you'll have to code it yourself.

For (4), with just a little Python code you can obviously format results in any way you want. The .clusters attribute of a KMeans object is a list, exactly as long as the number of data instances: if what you want is a list of lists of data instances, for example:

def loldikm(data, **k):
  km = orange.KMeans(data, **k)
  results = [[] for _ in km.centroids]
  for i, d in zip(km.clusters, data):
    results[i].append(d)
打小就很酷 2024-08-27 17:26:29

我认为原来的KMeans不适合余弦距离。由于它不在欧氏空间中,需要定义余弦距离的质心,并且不能保证收敛。但如果你的特征向量都是正的,你可以尝试。更多信息:为 k-means 中的用户定义距离函数添加 API

I think the original KMeans is not suitable for cosine distance. For its not in Euclidean space, You need to define the centroid of cosine distance, and you couldn't guarantee convergence. But if your feature vectors are all positive, you can try. More information: Add API for user defined distance function in k-means

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文