用于产品数据分析的最佳 Python 聚类库

发布于 2024-10-17 08:38:15 字数 1539 浏览 5 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

绿光 2024-10-24 08:38:15

你所拥有的是一个二分图。作为初始尝试,听起来您将把邻居列表视为零一向量,在它们之间定义某种相似性/相关性。例如,这可以是标准化的汉明距离。根据您采取的方式,您将获得单个域上的图表 - 产品代码或所有者。很快就会明白为什么我用图表语言来表达所有内容,请耐心等待。那么为什么你坚持使用 Python 实现呢?对大规模数据进行聚类非常消耗时间和内存。为了揭开谜底,我编写并仍在维护一种图聚类算法,该算法在生物信息学中广泛使用。它是线程化的,接受加权图,并已用于具有数百万个节点和数十亿条边的图。请参阅 http://micans.org/mcl/ 了解更多信息。当然,如果您搜索 stackoverflow 和 stackexchange,您可能会感兴趣很多线程。我也推荐 Louvain 方法,只是我不确定它是否接受您可能会生成的加权网络。

What you have is a bipartite graph. As an initial stab, it sounds like you are going to treat neighbour lists as zero-one vectors between which you define some kind of similarity/correlation. This could be a normalised Hamming distance for example. Depending on which way you do that you will obtain a graph on a single domain -- either product codes or owners. It will shortly become clear why I've cast everything in the language of graphs, bear with me. Now why do you insist on a Python implementation? Clustering large scale data is time and memory consuming. To pull the cat out of the bag, I have written and still maintain a graph clustering algorithm, used quite widely in bioinformatics. Is is threaded, accepts weighted graphs, and has been used for graphs with millions of nodes and towards a billion of edges. Refer to http://micans.org/mcl/ for more information. Of course, if you trawl stackoverflow and stackexchange there is quite a few threads that may be of interest to you. I would recommend the Louvain method as well, except that I am not sure whether it accepts weighted networks, which you will probably produce.

温馨耳语 2024-10-24 08:38:15

R语言有很多用于在数据中查找组的包,还有python与 R 的绑定,称为 RPy。 R 提供了这里已经提到的几种算法,并且还以在大型数据集上的良好性能而闻名。

R language has many packages for finding groups in data, and there are python bindings to R, called RPy. R provides several algorithms already mentioned here and also known for good performance on large datasets.

━╋う一瞬間旳綻放 2024-10-24 08:38:15

我对你的问题域不太了解。但 PyCluster 是相当不错的聚类包,它在大型数据集上运行良好:
http://bonsai.hgc.jp/~mdehoon/software/cluster/software .htm

希望有帮助。

I don't know much about your problem domain. But PyCluster is pretty decent clustering package which works good on large datasets:
http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm

Hope it helps.

谁对谁错谁最难过 2024-10-24 08:38:15

您可以尝试使用 k-means 聚类算法 及其 scipy 实现进行聚类scikits.learn.cluster.KMeans

You can try to do clustering using the k-means clustering algorithm and its scipy implementation available in scikits.learn.cluster.KMeans.

澜川若宁 2024-10-24 08:38:15

抱歉,我不知道有现成的库。有用于全文搜索和相似性的大型库,
但对于位集,你必须自己推出(据我所知)。
无论如何,有几个建议:

  • 位集方法:首先在内存中获取 10k 个所有者 x 100k 个产品,或 100k x 10k 个,以供使用。
    您可以使用 位数组制作一个 10k x 100k 位的大数组。
    但是,你想用它做什么?
    要在 N 个对象(所有者或产品)中查找相似对,
    你必须查看所有 N*(N-1)/2 对,这很多;
    或者,数据中必须存在某种允许早期修剪/分层相似性的结构;
    或者,谷歌“贪婪聚类”Python - 没有看到现成的库。

  • 您如何定义所有者/产品的“相似性”?有很多可能性 - 共同的数字、共同的比率、tf-idf ...

(补充):你看过 Mahout 的推荐系统 API 了吗?
这就是您要找的吗?
这个所以问题
说没有 Python 等效项,这留下了两个选择:
a) 询问是否有人使用过 Jython 的 Mahout,
或者 b) 如果你不能舔他们,那就加入他们。

I don't know of an off-the-shelf lib, sorry. There are big libs for full-text search and similarity,
but for bit sets you'll have to roll your own (as far as i know).
A couple of suggestions anyway:

  • bitset approach: first get say 10k owners x 100k products, or 100k x 10k, in memory, to play with.
    You could use bitarray to make a big array of 10k x 100k bits.
    But then, what do you want to do with it ?
    To find similar pairs among N objects (either owners or products),
    you have to look at all N*(N-1)/2 pairs, which is a lot;
    or, there must be some structure in tha data that allows early pruning / hierarchical similarity;
    or, google "greedy clustering" Python — don't see an off-the-shelf lib.

  • how do you define "similarity" of owners / of products ? There are lots of possibilities — number in common, ratio in common, tf-idf ...

(Added): have you looked at Mahout's recommendation system API,
is that about what you're looking for ?
This SO question
says there's no Python equivalent, which leaves two choices:
a) ask if anyone has used Mahout from Jython,
or b) if you can't lick 'em, join 'em.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文