使用hadoop进行简单统计计算的示例

发布于 2024-08-27 14:02:09 字数 271 浏览 8 评论 0原文

我想扩展现有的聚类算法来处理非常大的数据集,并重新设计它,使其现在可以通过数据分区进行计算,这为并行处理打开了大门。我一直在研究 Hadoop 和 Pig,我认为一个很好的实用起点是计算数据的基本统计数据,即算术平均值和方差。

我已经在谷歌上搜索了一段时间,但也许我没有使用正确的关键字,而且我还没有真正找到任何可以作为进行此类计算的良好入门知识的东西,所以我想我会在这里问。

任何人都可以向我指出一些如何使用 hadoop 计算均值和方差的好示例,和/或提供一些示例代码。

谢谢

I want to extend an existing clustering algorithm to cope with very large data sets and have redesigned it in such a way that it is now computable with partitions of data, which opens the door to parallel processing. I have been looking at Hadoop and Pig and I figured that a good practical place to start was to compute basic stats on my data, i.e. arithmetic mean and variance.

I've been googling for a while, but maybe I'm not using the right keywords and I haven't really found anything which is a good primer for doing this sort of calculation, so I thought I would ask here.

Can anyone point me to some good samples of how to calculate mean and variance using hadoop, and/or provide some sample code.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

星光不落少年眉 2024-09-03 14:02:09

Pig latin 有一个名为 PiggyBank 的相关可重用代码库,它具有许多方便的功能。不幸的是,我上次检查时没有差异,但也许情况已经改变了。如果不出意外,它可能会提供示例来帮助您开始自己的实现。

我应该注意到,方差很难在巨大的数据集上以稳定的方式实现,所以要小心!

Pig latin has an associated library of reusable code called PiggyBank that has numerous handy functions. Unfortunately it didn't have variance last time I checked, but maybe that has changed. If nothing else, it might provide examples to get you started on your own implementation.

I should note that variance is difficult to implement in a stable way over huge data sets, so take care!

冷情 2024-09-03 14:02:09

您可以仔细检查并查看您的集群代码是否可以放入级联中。使用现有的 java 库添加新函数、进行连接等非常简单。

http://www.cascading.org/

如果您喜欢 Clojure,您可能会观看这些 github 项目:
http://github.com/clj-sys

他们正在通过 Cascading 对 Clojure 中实现的新算法进行分层(其又分层于 Hadoop MapReduce)。

You might double check and see if your clustering code can drop into Cascading. Its quite trivial to add new functions, do joins, etc with your existing java libraries.

http://www.cascading.org/

And if you are into Clojure, you might watch these github projects:
http://github.com/clj-sys

They are layering new algorithms implemented in Clojure over Cascading (which in turn is layered over Hadoop MapReduce).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文