使用hadoop进行简单统计计算的示例
我想扩展现有的聚类算法来处理非常大的数据集,并重新设计它,使其现在可以通过数据分区进行计算,这为并行处理打开了大门。我一直在研究 Hadoop 和 Pig,我认为一个很好的实用起点是计算数据的基本统计数据,即算术平均值和方差。
我已经在谷歌上搜索了一段时间,但也许我没有使用正确的关键字,而且我还没有真正找到任何可以作为进行此类计算的良好入门知识的东西,所以我想我会在这里问。
任何人都可以向我指出一些如何使用 hadoop 计算均值和方差的好示例,和/或提供一些示例代码。
谢谢
I want to extend an existing clustering algorithm to cope with very large data sets and have redesigned it in such a way that it is now computable with partitions of data, which opens the door to parallel processing. I have been looking at Hadoop and Pig and I figured that a good practical place to start was to compute basic stats on my data, i.e. arithmetic mean and variance.
I've been googling for a while, but maybe I'm not using the right keywords and I haven't really found anything which is a good primer for doing this sort of calculation, so I thought I would ask here.
Can anyone point me to some good samples of how to calculate mean and variance using hadoop, and/or provide some sample code.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Pig latin 有一个名为 PiggyBank 的相关可重用代码库,它具有许多方便的功能。不幸的是,我上次检查时没有差异,但也许情况已经改变了。如果不出意外,它可能会提供示例来帮助您开始自己的实现。
我应该注意到,方差很难在巨大的数据集上以稳定的方式实现,所以要小心!
Pig latin has an associated library of reusable code called PiggyBank that has numerous handy functions. Unfortunately it didn't have variance last time I checked, but maybe that has changed. If nothing else, it might provide examples to get you started on your own implementation.
I should note that variance is difficult to implement in a stable way over huge data sets, so take care!
您可以仔细检查并查看您的集群代码是否可以放入级联中。使用现有的 java 库添加新函数、进行连接等非常简单。
http://www.cascading.org/
如果您喜欢 Clojure,您可能会观看这些 github 项目:
http://github.com/clj-sys
他们正在通过 Cascading 对 Clojure 中实现的新算法进行分层(其又分层于 Hadoop MapReduce)。
You might double check and see if your clustering code can drop into Cascading. Its quite trivial to add new functions, do joins, etc with your existing java libraries.
http://www.cascading.org/
And if you are into Clojure, you might watch these github projects:
http://github.com/clj-sys
They are layering new algorithms implemented in Clojure over Cascading (which in turn is layered over Hadoop MapReduce).