We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(6)
你所拥有的是一个二分图。作为初始尝试,听起来您将把邻居列表视为零一向量,在它们之间定义某种相似性/相关性。例如,这可以是标准化的汉明距离。根据您采取的方式,您将获得单个域上的图表 - 产品代码或所有者。很快就会明白为什么我用图表语言来表达所有内容,请耐心等待。那么为什么你坚持使用 Python 实现呢?对大规模数据进行聚类非常消耗时间和内存。为了揭开谜底,我编写并仍在维护一种图聚类算法,该算法在生物信息学中广泛使用。它是线程化的,接受加权图,并已用于具有数百万个节点和数十亿条边的图。请参阅 http://micans.org/mcl/ 了解更多信息。当然,如果您搜索 stackoverflow 和 stackexchange,您可能会感兴趣很多线程。我也推荐 Louvain 方法,只是我不确定它是否接受您可能会生成的加权网络。
What you have is a bipartite graph. As an initial stab, it sounds like you are going to treat neighbour lists as zero-one vectors between which you define some kind of similarity/correlation. This could be a normalised Hamming distance for example. Depending on which way you do that you will obtain a graph on a single domain -- either product codes or owners. It will shortly become clear why I've cast everything in the language of graphs, bear with me. Now why do you insist on a Python implementation? Clustering large scale data is time and memory consuming. To pull the cat out of the bag, I have written and still maintain a graph clustering algorithm, used quite widely in bioinformatics. Is is threaded, accepts weighted graphs, and has been used for graphs with millions of nodes and towards a billion of edges. Refer to http://micans.org/mcl/ for more information. Of course, if you trawl stackoverflow and stackexchange there is quite a few threads that may be of interest to you. I would recommend the Louvain method as well, except that I am not sure whether it accepts weighted networks, which you will probably produce.
R语言有很多用于在数据中查找组的包,还有python与 R 的绑定,称为 RPy。 R 提供了这里已经提到的几种算法,并且还以在大型数据集上的良好性能而闻名。
R language has many packages for finding groups in data, and there are python bindings to R, called RPy. R provides several algorithms already mentioned here and also known for good performance on large datasets.
我认为你可以使用 pycluster 来改变你的问题的算法
我也认为你最好看看这个http://www.dennogumi.org/2007/11/data-clustering-with-python
I think you can use pycluster also change algorithm for your problem
also i think you better see this http://www.dennogumi.org/2007/11/data-clustering-with-python
我对你的问题域不太了解。但 PyCluster 是相当不错的聚类包,它在大型数据集上运行良好:
http://bonsai.hgc.jp/~mdehoon/software/cluster/software .htm
希望有帮助。
I don't know much about your problem domain. But PyCluster is pretty decent clustering package which works good on large datasets:
http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm
Hope it helps.
您可以尝试使用 k-means 聚类算法 及其 scipy 实现进行聚类scikits.learn.cluster.KMeans。
You can try to do clustering using the k-means clustering algorithm and its scipy implementation available in scikits.learn.cluster.KMeans.
抱歉,我不知道有现成的库。有用于全文搜索和相似性的大型库,
但对于位集,你必须自己推出(据我所知)。
无论如何,有几个建议:
位集方法:首先在内存中获取 10k 个所有者 x 100k 个产品,或 100k x 10k 个,以供使用。
您可以使用 位数组制作一个 10k x 100k 位的大数组。
但是,你想用它做什么?
要在 N 个对象(所有者或产品)中查找相似对,
你必须查看所有 N*(N-1)/2 对,这很多;
或者,数据中必须存在某种允许早期修剪/分层相似性的结构;
或者,谷歌“贪婪聚类”Python - 没有看到现成的库。
您如何定义所有者/产品的“相似性”?有很多可能性 - 共同的数字、共同的比率、tf-idf ...
(补充):你看过 Mahout 的推荐系统 API 了吗?
这就是您要找的吗?
这个所以问题
说没有 Python 等效项,这留下了两个选择:
a) 询问是否有人使用过 Jython 的 Mahout,
或者 b) 如果你不能舔他们,那就加入他们。
I don't know of an off-the-shelf lib, sorry. There are big libs for full-text search and similarity,
but for bit sets you'll have to roll your own (as far as i know).
A couple of suggestions anyway:
bitset approach: first get say 10k owners x 100k products, or 100k x 10k, in memory, to play with.
You could use bitarray to make a big array of 10k x 100k bits.
But then, what do you want to do with it ?
To find similar pairs among N objects (either owners or products),
you have to look at all N*(N-1)/2 pairs, which is a lot;
or, there must be some structure in tha data that allows early pruning / hierarchical similarity;
or, google "greedy clustering" Python — don't see an off-the-shelf lib.
how do you define "similarity" of owners / of products ? There are lots of possibilities — number in common, ratio in common, tf-idf ...
(Added): have you looked at Mahout's recommendation system API,
is that about what you're looking for ?
This SO question
says there's no Python equivalent, which leaves two choices:
a) ask if anyone has used Mahout from Jython,
or b) if you can't lick 'em, join 'em.