大规模机器学习

发布于 2024-09-08 21:53:41 字数 253 浏览 7 评论 0原文

我需要在大数据集(10-1000 亿条记录)上运行各种机器学习技术 这些问题主要围绕文本挖掘/信息提取,包括各种内核技术,但不限于它们(我们使用一些贝叶斯方法、引导、梯度提升、回归树——许多不同的问题和解决它们的方法

)最佳实施?我在机器学习方面有经验,但对如何处理大型数据集没有太多经验 是否有任何可扩展和可定制的利用 MapReduce 基础设施的机器学习库 强烈偏好c++,但Java和python也可以 Amazon Azure 还是自己的数据中心(我们负担得起)?

I need to run various machine learning techniques on a big dataset (10-100 billions records)
The problems are mostly around text mining/information extraction and include various kernel techniques but are not restricted to them (we use some bayesian methods, bootstrapping, gradient boosting, regression trees -- many different problems and ways to solve them)

What would be the best implementation? I'm experienced in ML but do not have much experience how to do it for huge datasets
Is there any extendable and customizable Machine Learning libraries utilizing MapReduce infrastructure
Strong preference to c++, but Java and python are ok
Amazon Azure or own datacenter (we can afford it)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

站稳脚跟 2024-09-15 21:53:41

除非您尝试学习的分类状态空间非常大,否则我预计在包含 10-1000 亿条记录或训练样本的文本挖掘数据集中存在显着的冗余。作为粗略的猜测,我怀疑人们需要的不仅仅是 1-2% 的随机样本子集来学习可靠的分类器,这些分类器在交叉验证测试中能够很好地保持。

快速文献检索得出以下相关论文。 Tsang 论文声称 n 个训练样本的时间复杂度为 O(n),并且有与之相关的软件 LibCVM 工具包。 Wolfe 论文描述了一种基于 MapReduce 的分布式 EM 方法。

最后,有一个

参考文献

Ivor W. Tsang、James T. Kwok、Pak-Ming Cheung (2005)。 "核心向量机:在超大型数据集上进行快速 SVM 训练 ”,《机器学习研究杂志》,第 6 卷,第 363–392 页。

J·沃尔夫、A·哈吉、D·克莱因 (2008)。 “用于超大型数据集的完全分布式 EM”,会议记录第 25 届国际机器学习会议,第 1184-1191 页。

奥利维尔·坎普、若奎姆·BL·菲利佩、斯里曼·哈穆迪和马里奥·皮亚蒂尼 (2005)。 “使用支持向量机算法挖掘非常大的数据集”,企业信息系统 V ,施普林格荷兰,第 177-184 页。

Unless the classification state space you are attempting to learn is extremely large, I would expect that there is significant redundancy in a text-mining-focused dataset with 10-100 billion records or training samples. As a rough guess, I would doubt that one would need much more than a 1-2% random sample subset to learn reliable classifiers that would hold up well under cross-validation testing.

A quick literature search came up with the following relevant papers. The Tsang paper claims O(n) time complexity for n training samples, and there is software related to it available as the LibCVM toolkit. The Wolfe paper describes a distributed EM approach based on MapReduce.

Lastly, there was a Large-Scale Machine Learning workshop at the NIPS 2009 conference that looks to have had lots of interesting and relevant presentations.

References

Ivor W. Tsang, James T. Kwok, Pak-Ming Cheung (2005). "Core Vector Machines: Fast SVM Training on Very Large Data Sets", Journal of Machine Learning Research, vol 6, pp 363–392.

J Wolfe, A Haghighi, D Klein (2008). "Fully Distributed EM for Very Large Datasets", Proceedings of the 25th International Conference on Machine Learning, pp 1184-1191.

Olivier Camp, Joaquim B. L. Filipe, Slimane Hammoudi and Mario Piattini (2005). "Mining Very Large Datasets with Support Vector Machine Algorithms ", Enterprise Information Systems V, Springer Netherlands, pp 177-184.

眉目亦如画i 2024-09-15 21:53:41

Apache Mahout 就是您正在寻找的。

Apache Mahout is what you are looking for.

装纯掩盖桑 2024-09-15 21:53:41

我不知道有任何使用 map/reduce 的 ML 库。也许您有能力同时使用 ML 库和 Map/Reduce 库?您可能想研究一下 Hadoop 的 Map/Reduce:
http://hadoop.apache.org/mapreduce/

你必须实现reduce和地图方法。事实上,您使用了如此多的技术可能会使事情变得复杂。

您可以在自己的集群上运行它,或者如果您正在做研究,也许您可​​以查看 BOINC (http://boinc. berkeley.edu/)。

另一方面,也许您可​​以减少数据集。我不知道你在训练什么,但是 100 亿条记录中肯定存在一些冗余......

Im not aware of any ML library that uses map/reduce. Maybe you have the capability to use an ML library and a Map/Reduce library together? You might want to look into Hadoop's Map/Reduce:
http://hadoop.apache.org/mapreduce/

you would have to implement the reduce and the map methods. The fact that you use so many techniques might complicate this.

you can run it on your own cluster or if you are doing research maybe you could look into BOINC (http://boinc.berkeley.edu/).

On the other hand, maybe you can reduce your data-set. I have no idea what you are training on, but there must be some redundancy in 10 billion records...

已下线请稍等 2024-09-15 21:53:41

我不知道有哪个 ML 库可以支持 10 到 1000 亿条记录,这有点极端,所以我不希望找到任何现成的东西。我建议您查看 Netflix 获奖者:http: //www.netflixprize.com//community/viewtopic.php?id=1537

NetFlix 奖项有超过 1 亿个参赛作品,因此,虽然它不像您的数据集那么大,但您仍然可能会发现他们的解决方案适用的。 BelKor 团队所做的就是结合多种算法(类似于集成学习) )并对每个算法的“预测”或输出进行加权。

I don't know of any ML libraries that can support 10 to 100 billion records, that's a bit of an extreme so I wouldn't expect to find anything off the shelf. What I would recommend is that you take a look at NetFlix prize winners: http://www.netflixprize.com//community/viewtopic.php?id=1537

The NetFlix prize had over 100 million entries, so while it's not quite as big as your data set you may still find their solutions to be applicable. What the BelKor team did was to combine multiple algorithms (something similar to ensemble learning) and weight the "prediction" or output of each algorithm.

千紇 2024-09-15 21:53:41

请访问 http://hunch.net/?p=1068 了解有关 Vowpal Wabbit 的信息;它是一个适用于大规模应用的随机梯度下降库。

Take a look at http://hunch.net/?p=1068 for info on Vowpal Wabbit; it's a stochastic gradient descent library for large-scale applications.

横笛休吹塞上声 2024-09-15 21:53:41

我的一个朋友也参与过类似的项目。他使用 Perl 进行文本挖掘,使用 Matlab 进行贝叶斯方法、潜在语义分析和高斯混合等技术......

A friend of mine has worked on a similar project. He used perl for text mining and matlab for techniques as bayesian methods, latent semantic analysis and gaussian mixture...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文