数据挖掘方面的 R 与 Matlab 比较

发布于 2024-10-14 11:27:45 字数 1700 浏览 4 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

看透却不说透 2024-10-21 11:27:45

在过去三年左右的时间里,我每天都在使用 R,并且日常使用的最大部分花在机器学习/数据挖掘问题上。

在大学期间,我是 Matlab 的独家用户;当时我以为是
一套优秀的工具/平台。我确信今天也是如此。

神经网络工具箱、优化工具箱、统计工具箱、
和曲线拟合工具箱都是非常理想的(如果不是必需的)
对于使用 MATLAB 进行机器学习/数据挖掘工作的人来说,但它们都是独立于
基本 MATLAB 环境 - 换句话说,它们必须单独购买。

我的学习 R ML/数据挖掘的前 5 名列表

这涉及到几件事:首先,一组 R 包,全部以 arules 开头(可从 CRAN 获得);您可以在项目主页上找到完整列表(arules、aruluesViz 等)。其次,所有这些软件包都基于称为“基于市场的分析”或“关联规则”的数据挖掘技术。在许多方面,这一系列算法是数据挖掘的本质——详尽地遍历大型事务数据库并找到这些数据库中的字段(变量或特征)之间高于平均水平的关联或相关性。在实践中,您将它们连接到数据源并让它们运行过夜。上述集合中的核心 R 包称为 arules;在arules的CRAN Package页面上,您会找到一些关于arules包和一般关联规则技术的优秀二手资源(R词典中的vignettes)的链接.

本书的最新版本以数字形式免费提供。同样,在本书的网站(链接到上面)上有 ESL 中使用的所有数据集,可以免费下载。 (顺便说一句,我有免费的数字版本;我还从 BN.com 购买了精装版本;数字版本中的所有彩色图均在精装版本中复制。)ESL 包含对至少一个示例的全面介绍大部分主要
ML 规则——例如神经网络、SVM、KNN;无监督的
技术(LDA、PCA、MDS、SOM、聚类)、多种回归、CART、
贝叶斯技术,以及模型聚合技术(Boosting、Bagging)
和模型调整(正则化)。最后,从 CRAN 获取本书附带的 R 包(这将省去下载输入数据集的麻烦)。

  • CRAN 任务视图:机器学习

提供 +3,500 个软件包
for R 按域分为大约 30 个包系列或“任务视图”。机器学习
就是这些家庭之一。机器学习任务视图包含大约 50 个左右
包裹。其中一些软件包是核心发行版的一部分,包括 e1071
(一个庞大的 ML 包,其中包含相当多的工作代码
通常的 ML 类别。)

特别关注标记的帖子通过预测分析

对代码的彻底研究本身就是对 R 中的 ML 的出色介绍。

最后一个资源我认为非常出色,但没有进入前 5 名:

发布在博客美丽的 WWW

For the past three years or so, i have used R daily, and the largest portion of that daily use is spent on Machine Learning/Data Mining problems.

I was an exclusive Matlab user while in University; at the time i thought it was
an excellent set of tools/platform. I am sure it is today as well.

The Neural Network Toolbox, the Optimization Toolbox, Statistics Toolbox,
and Curve Fitting Toolbox are each highly desirable (if not essential)
for someone using MATLAB for ML/Data Mining work, yet they are all separate from
the base MATLAB environment--in other words, they have to be purchased separately.

My Top 5 list for Learning ML/Data Mining in R:

This refers to a couple things: First, a group of R Package that all begin arules (available from CRAN); you can find the complete list (arules, aruluesViz, etc.) on the Project Homepage. Second, all of these packages are based on a data-mining technique known as Market-Basked Analysis and alternatively as Association Rules. In many respects, this family of algorithms is the essence of data-mining--exhaustively traverse large transaction databases and find above-average associations or correlations among the fields (variables or features) in those databases. In practice, you connect them to a data source and let them run overnight. The central R Package in the set mentioned above is called arules; On the CRAN Package page for arules, you will find links to a couple of excellent secondary sources (vignettes in R's lexicon) on the arules package and on Association Rules technique in general.

The most current edition of this book is available in digital form for free. Likewise, at the book's website (linked to just above) are all data sets used in ESL, available for free download. (As an aside, i have the free digital version; i also purchased the hardback version from BN.com; all of the color plots in the digital version are reproduced in the hardbound version.) ESL contains thorough introductions to at least one exemplar from most of the major
ML rubrics--e.g., neural metworks, SVM, KNN; unsupervised
techniques (LDA, PCA, MDS, SOM, clustering), numerous flavors of regression, CART,
Bayesian techniques, as well as model aggregation techniques (Boosting, Bagging)
and model tuning (regularization). Finally, get the R Package that accompanies the book from CRAN (which will save the trouble of having to download the enter the datasets).

  • CRAN Task View: Machine Learning

The +3,500 Packages available
for R are divided up by domain into about 30 package families or 'Task Views'. Machine Learning
is one of these families. The Machine Learning Task View contains about 50 or so
Packages. Some of these Packages are part of the core distribution, including e1071
(a sprawling ML package that includes working code for quite a few of
the usual ML categories.)

With particular focus on the posts tagged with Predictive Analytics

A thorough study of the code would, by itself, be an excellent introduction to ML in R.

And one final resource that i think is excellent, but didn't make in the top 5:

posted at the blog A Beautiful WWW

长伴 2024-10-21 11:27:45

请查看 CRAN 任务视图,特别是 关于机器学习和统计学习的 CRAN 任务视图很好地总结了这一点。

Please look at the CRAN Task Views and in particular at the CRAN Task View on Machine Learning and Statistical Learning which summarises this nicely.

巷子口的你 2024-10-21 11:27:45

如果您正在进行大量矩阵运算,Matlab 和 R 都很好。因为他们可以为此使用高度优化的低级代码(BLAS 库等)。

然而,数据挖掘不仅仅是处理矩阵。很多人完全忽视了数据挖掘的整个数据组织方面(而不是简单的机器学习)。

一旦开始数据组织,R 和 Matlab 就很痛苦。尝试在 R 或 matlab 中实现 R* 树,将 O(n^2) 算法降低到 O(n log n) 运行时间。首先,它完全违背了 R 和 Matlab 的设计方式(尽可能使用批量数学运算),其次它会降低你的性能。例如,解释的 R 代码似乎以 C 代码的 50% 左右的速度运行(尝试 R 内置 k-means 与 flexclus k-means); BLAS 库被优化到疯狂的水平,利用缓存大小、数据对齐、高级 CPU 功能。如果您喜欢冒险,请尝试在 R 或 Matlab 中实现手动矩阵乘法,并将其与本机乘法进行基准测试。

别误会我的意思。在很多方面,R 和 matlab 都优雅非常适合原型设计。只需 10 行代码即可解决很多问题,并获得不错的性能。用手写同样的东西会需要数百行,而且速度可能要慢 10 倍。但有时您可以通过一定程度的复杂性进行优化,对于大型数据集,这确实击败了 R 和 matlab 的优化矩阵运算。

从长远来看,如果您想扩展到“Hadoop 大小”,您还必须考虑数据布局和组织,除非您需要的只是对数据进行线性扫描。但是,您也可以只是采样!

Both Matlab and R are good if you are doing matrix-heavy operations. Because they can use highly optimized low-level code (BLAS libraries and such) for this.

However, there is more to data-mining than just crunching matrixes. A lot of people totally neglect the whole data organization aspect of data mining (as opposed to say, plain machine learning).

And once you get to data organization, R and Matlab are a pain. Try implementing an R*-tree in R or matlab to take an O(n^2) algorithm down to O(n log n) runtime. First of all, it totally goes against the way R and Matlab are designed (use bulk math operations wherever possible), secondly it will kill your performance. Interpreted R code for example seems to run at around 50% of the speed of the C code (try R built-in k-means vs. flexclus k-means); and the BLAS libraries are optimized to an insane level, exploiting cache sizes, data alignment, advanced CPU features. If you are adventurous, try implementing a manual matrix multiplication in R or Matlab, and benchmark it against the native one.

Don't get me wrong. There is a lot of stuff where R and matlab are just elegant and excellent for prototyping. You can solve a lot of things in just 10 lines of code, and get a decent performance out of it. Writing the same thing by hand would be hundreds of lines, and probably 10x slower. But sometimes you can optimize by a level of complexity, which for large data sets does beat the optimized matrix operations of R and matlab.

If you want to scale up to "Hadoop size" on the long run, you will have to think about data layout and organization, too, unless all you need is a linear scan over the data. But then, you could just be sampling, too!

翻身的咸鱼 2024-10-21 11:27:45

昨天我发现了两本关于数据挖掘的新书。这些以“数据挖掘”为标题的系列书籍通过对新颖的挖掘算法和许多有用的应用程序进行深入描述来满足这一需求。除了深入理解每一节之外,这两本书还在以下章节中提供了解决问题的有用提示和策略。数据挖掘技术的进步和公众的广泛普及建立了对该主题的综合文本的需求。书籍有:“数据挖掘中的新基础技术”http://www.intechopen.com/books/show/title/new-fundamental-technologies-in-data-mining & “数据挖掘中的面向知识的应用程序”此处 http ://www.intechopen.com/books/show/title/knowledge-orient-applications-in-data-mining 这些是开放获取书籍,因此您可以免费下载或在在线阅读平台上阅读,例如我愿意。干杯!

Yesterday I found two new books about Data mining. These series of books entitled by ‘Data Mining’ address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters.The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. Books are: “New Fundamental Technologies in Data Mining” here http://www.intechopen.com/books/show/title/new-fundamental-technologies-in-data-mining & “Knowledge-Oriented Applications in Data Mining” here http://www.intechopen.com/books/show/title/knowledge-oriented-applications-in-data-mining These are open access books so you can download it for free or just read on online reading platform like I do. Cheers!

故事与诗 2024-10-21 11:27:45

我们不应该忘记这两个软件的起源:科学计算和信号处理导致了Matlab,而统计则导致了R。

我在大学里经常使用matlab,因为我们在Unix上安装了一个并向所有学生开放。然而,Matlab 的价格太高,尤其是与免费的 R 相比。如果您的主要关注点不是矩阵计算和信号处理,R 应该可以很好地满足您的需求。

We should not forget the origin sources for these two software: scientific computation and also signal processing leads to Matlab but statistics leads to R.

I used matlab a lot in University since we have one installed on Unix and open to all students. However, the price for Matlab is too high especially compared to free R. If your major focus is not on matrix computation and signal processing, R should work well for your needs.

用心笑 2024-10-21 11:27:45

我认为这还取决于你从事哪个研究领域。我知道从事沿海研究的人大量使用 Matlab。在这个组中使用 R 会让你的生活变得更加困难。如果同事解决了一个问题,你不能使用它,因为他使用 Matlab 修复了它。

I think it also depends in which field of study you are. I know of people in coastal research that use a lot of Matlab. Using R in this group would make your life more difficult. If a colleague has solved a problem, you can't use it because he fixed it using Matlab.

献世佛 2024-10-21 11:27:45

当您处理大量数据时,我还会研究每种方法的功能。我知道 R 可能在这方面存在问题,并且如果您习惯于迭代数据挖掘过程,则可能会受到限制。例如,同时查看多个模型。我不知道MATLAB是否有数据限制。

I would also look at the capabilities of each when you are dealing with large amounts of data. I know that R can have problems with this, and might be restrictive if you are used to an iterative data mining process. For example looking at multiple models concurrently. I don't know if MATLAB has a data limitation.

扛起拖把扫天下 2024-10-21 11:27:45

我承认偏爱 MATLAB 来解决数据挖掘问题,并在此给出一些推理:

为什么使用 MATLAB 进行数据挖掘?

我承认我对 R/S-Plus 只略知一二,但我会做出以下观察:

  1. R 绝对具有更多的统计功能比 MATLAB 更专注。我更喜欢在 MATLAB 中构建自己的工具,这样我就能准确地知道它们在做什么,并且可以自定义它们,但这在 MATLAB 中比在 R 中更必要。

  2. 新统计技术的代码(空间统计、稳健统计等)经常出现在 S-Plus 的早期(我认为这至少会转移到 R 中)。

  3. 几年前,我发现商业版的R、S-Plus的数据容量极其有限。我不能说 R/S-Plus 现在的状况如何,但您可能想检查一下您的数据是否适合此类工具。

I admit to favoring MATLAB for data mining problems, and I give some of my reasoning here:

Why MATLAB for Data Mining?

I will admit to only a passing familiarity with R/S-Plus, but I'll make the following observations:

  1. R definitely has more of a statistical focus than MATLAB. I prefer building my own tools in MATLAB, so that I know exactly what they're doing, and I can customize them, but this is more of a necessity in MATLAB than it would be in R.

  2. Code for new statistical techniques (spatial statistics, robust statistics, etc.) often appears early in S-Plus (I assume that this carries over to R, at least some).

  3. Some years ago, I found the commercial version of R, S-Plus to have an extremely limited capacity for data. I cannot say what the state of R/S-Plus is today, but you may want to check if your data will fit into such tools comfortably.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文