社交网络 FOAF 数据集上的关联规则挖掘
我正在开展一个名为“从社交网络数据中发现关联规则:将数据挖掘引入语义网”的项目。任何人都可以建议一个好的算法源(及其代码。我听说它可以使用 Perl 和 R 包来实现)来从社交网络数据库中查找关联规则?
数据库快照可以通过以下链接获取:https: //docs.google.com/uc?id=0B0mXGRdRowo1MDZlY2Q0NDYtYjlhMi00MmNjLWFiMWEtOGQ0MjA3NjUyZTE5&export=download&hl=en_US
该数据集可通过以下链接获取:http://ebiquity.umbc.edu/get/a/resource/82.zip
我已经搜索了很多关于这个项目的信息,但不幸的是不能'还没有找到有用的东西。我发现以下链接有些相关:
犯罪数据:http:// www.computer.org/portal/web/csdl/doi/10.1109/CSE.2009.435
我们将非常感谢您的帮助。
谢谢你,
I am working on a project called "association rule discovery from social network data: Introducing Data Mining to the Semantic Web". Can anyone suggest a good source for an algorithm (and its code. I heard that it can be implemented using Perl and also R packages) to find association rules from a social network database?
The snapshot of the database can be got in the following link: https://docs.google.com/uc?id=0B0mXGRdRowo1MDZlY2Q0NDYtYjlhMi00MmNjLWFiMWEtOGQ0MjA3NjUyZTE5&export=download&hl=en_US
The dataset is available on the following link: http://ebiquity.umbc.edu/get/a/resource/82.zip
I have searched a lot regarding this project but unfortunately can't find something useful as yet. The following link I found somewhat related:
Criminal data : http://www.computer.org/portal/web/csdl/doi/10.1109/CSE.2009.435
Your help will be highly appreciated.
Thank You,
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
原始关联规则算法(最初由 IBM Almaden 研究中心开发)最广泛使用的实现是 Apriori 和 Eclat,特别是 Christian Borgelt 的 C 实现。
(为不熟悉关联规则(又名“频繁项集”或“市场购物篮分析”的人提供的简短摘要)。关联规则的原型应用程序正在分析消费者交易,例如超市数据:在购买波兰香肠的购物者中,百分比是多少?其中还购买黑面包?)
我会推荐统计平台,R。它是免费的,并且开源,其包存储库包含(至少)四个专门针对关联规则的库,所有库都具有出色的文档 - 四个包中的三个包括手册和单独的Vignette(带有代码示例的非正式散文文档)。手册和 Vignettes 都包含大量 R 代码示例。
我已经使用了下面四个软件包中的三个,我强烈推荐这三个。其中包括 Eclat 和 Apriori 的绑定。这些库作为 R“包”分发,可在 R 的主要包存储库 CRAN 上获取。 R 的基本安装和设置很简单——有适用于 Mac、Linux 和 Windows 的二进制文件,可以从上面的链接获取。同样,软件包安装/集成就像您对集成平台所期望的那样简单(尽管并非下面列出的四个软件包中的每一个都具有适用于每个操作系统的二进制文件)。
因此,在CRAN上,您会发现这些包都仅针对关联规则:
arules
arulesNBMiner
这组四个 R 包是由四种不同关联规则实现的 R 绑定以及可视化库组成。
第一个包 arules 包含 Eclat 和 Apriori 的 R 绑定。第二个,arulesNBMiner,是 Michael Hahsler 的关联规则算法 NB 频繁项集 的绑定。第三个,arules Sequences,是 Mohammed Zaki 的 cSPADE 的绑定。
最后一个特别有用,因为它是一个可视化库,用于绘制前三个包中任何一个的输出。对于您的社交网络研究,我怀疑您会发现图形可视化,即节点(数据集中的用户)和边(它们之间的连接)的显式可视化。
Well, the most widely used implementations of the original Association Rules algorithm (originally developed at IBM Almaden Research Center) are Apriori, and Eclat, in particular, the C implementations by Christian Borgelt.
(Brief summary for anyone not familiar with Association Rules (aka "Frequent Items Sets", or "Market Basket Analysis"). The prototype application for Association Rules is analyzing consumer transactions, e.g., supermarket data: Among shoppers who buy polish sausage what percentage of those also also purchase black bread?)
I would recommend the statistical platform, R. It is free and open source, and its package repository contains (at least) four libraries directed solely to Association Rules, all with excellent documentation--three of the four Packages include a Manual and a separate Vignette (informal prose document with code examples). Both the Manuals and Vignettes contain numerous examples in R code.
I have used three of the four Packages below and i can recommend those three highly. Among them are bindings for Eclat and Apriori. These libraries are distributed as R 'Packages', which are available on CRAN, R's primary Package repository. Basic installation and setup of R is trivial--there are binaries for Mac, Linux, and Windows, available from the link above. Likewise, Package installation/integration is as simple as you would expect from an integrated platform (though not every one of the four Packages listed below have binaries for every OS though).
So on CRAN, you will find these Packages all directed solely Association Rules:
arules
arulesNBMiner
arulesSequences
arulesViz
This set of four R Packages is comprised of R bindings for four different Association Rules implementations, as well as a visualization library.
The first package, arules, includes R bindings for Eclat and Apriori. The second, arulesNBMiner, is the bindings for Michael Hahsler's Association Rules algorithm NB-frequent itemsets by . The third, arules Sequences, is the bindings for Mohammed Zaki's cSPADE .
The last of these is particularly useful because it is a visualization library for plotting the output from any of the previous three packages. For your social network study, i suspect you will find the graph visualization--i.e., explicit visualization of the nodes (users in the data set) and edges (connections between them).
这比 http://en.wikipedia.org/wiki/Association_rule_learning 更广泛,但是希望有用。
一些可能有趣的早期 FOAF 工作(SVD/PCA 等):
http://stderr.org/~ elw/foaf/
http://www.scribd.com/doc/353326/The-Social-Semantics-of-LiveJournal-FOAF-Structure-and-Change-from-2004-to-2005
http://datamining.sztaki.hu/files/snakdd.pdf
4 个 http://www.amazon.com/Understanding-Complex-Datasets-Decompositions- Knowledge/dp/1584888326致力于矩阵分解技术针对图数据结构的应用;强烈推荐。
最后,Apache Mahout 是大规模数据挖掘、机器学习等的自然选择,https ://cwiki.apache.org/MAHOUT/dimension-reduction.html
This is a bit broader than http://en.wikipedia.org/wiki/Association_rule_learning but hopefully useful.
Some earlier FOAF work that might be interesting (SVD/PCA etc):
http://stderr.org/~elw/foaf/
http://www.scribd.com/doc/353326/The-Social-Semantics-of-LiveJournal-FOAF-Structure-and-Change-from-2004-to-2005
http://datamining.sztaki.hu/files/snakdd.pdf
Also Ch.4 of http://www.amazon.com/Understanding-Complex-Datasets-Decompositions-Knowledge/dp/1584888326 is devoted to the application of matrix decomposition techniques against graph data structures; strongly recommended.
Finally, Apache Mahout is the natural choice for large scale data mining, machine learning etc., https://cwiki.apache.org/MAHOUT/dimensional-reduction.html
如果您想要一些 Java 代码,您可以查看我的 网站 以获取 SPMF 软件。它提供了超过45种算法的源代码,用于频繁项集挖掘、关联挖掘、顺序模式挖掘等。
而且,它不仅提供最流行的算法。它还提供许多变体,例如挖掘稀有项集、高效用项集、不确定项集、非冗余关联规则、封闭关联规则、间接关联规则、top-k关联规则等等...
If you want some Java code, you can check my website for the SPMF software. It provides source code for more than 45 algorithms for frequent itemset mining, association mining, sequential pattern mining, etc.
Moreover, it does not only provide the most popular algorithms. It also offers many variations such as mining rare itemsets, high utility itemsets, uncertain itemsets, non redundant association rules, closed association rules, indirect association rules, top-k association rules, and much more...