如何在有或没有 Pig 的情况下使用 Cassandra 的 Map Reduce?
有人可以解释 MapReduce 如何与 Cassandra .6 配合使用吗?我已经阅读了字数统计示例,但我不太明白 Cassandra 端与“客户端”端发生的情况。
https://svn.apache.org/repos/asf/cassandra/ trunk/contrib/word_count/
例如,假设我正在使用 Python 和 Pycassa,我将如何加载新的 Map Reduce 函数,然后调用它?我的 MapReduce 函数必须是安装在 cassandra 服务器上的 java 吗?如果是这样,我如何从 Pycassa 调用它?
还提到 Pig 让这一切变得更容易,但我是一个十足的 Hadoop 菜鸟,所以这并没有真正帮助。
你的答案可以使用 Thrift 或其他什么,我刚刚提到 Pycassa 来表示客户端。我只是想了解 Cassandra 集群中运行的内容与发出请求的实际服务器之间的区别。
Can someone explain how MapReduce works with Cassandra .6? I've read through the word count example, but I don't quite follow what's happening on the Cassandra end vs. the "client" end.
https://svn.apache.org/repos/asf/cassandra/trunk/contrib/word_count/
For instance, let's say I'm using Python and Pycassa, how would I load in a new map reduce function, and then call it? Does my map reduce function have to be java that's installed on the cassandra server? If so, how do I call it from Pycassa?
There's also mention of Pig making this all easier, but I'm a complete Hadoop noob, so that didn't really help.
Your answer can use Thrift or whatever, I just mentioned Pycassa to denote the client side. I'm just trying to understand the difference between what runs in the Cassandra cluster vs. the actual server making the requests.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
据我所知(以及此处),开发人员编写 MapReduce 程序的方式使用Cassandra作为数据源如下。您编写一个常规的 MapReduce 程序(您链接到的示例适用于纯 Java 版本),现在可用的 jar 提供了一个 CustomInputFormat,允许输入源为 Cassandra(而不是默认的 Hadoop)。
如果您使用 Pycassa,我想说您运气不好,直到 (1) 该项目的维护者添加了对 MapReduce 的支持,或者 (2) 您将一些 Python 函数放在一起来编写 Java MapReduce 程序并运行它。后者肯定有点麻烦,但可以让你开始工作。
From what I've heard (and from here), the way that a developer writes a MapReduce program that uses Cassandra as the data source is as follows. You write a regular MapReduce program (the example you linked to is for the pure-Java version) and the jars that are now available provide a CustomInputFormat that allows the input source to be Cassandra (instead of the default, which is Hadoop).
If you're using Pycassa I'd say you're out of luck until either (1) the maintainer of that project adds support for MapReduce or (2) you throw some Python functions together that write up a Java MapReduce program and run it. The latter is definitely a bit of a hack but would get you up and going.
它了解当地情况; Cassandra InputFormat 重写 getLocations() 以保留数据局部性
It Knows about the locality ; The Cassandra InputFormat overrides getLocations() to preserve data locality
使用 cassandra 的直接 InputFormat 的好处是它可以有效地传输数据,这是一个非常大的好处。每个输入分割涵盖一系列令牌,并以其全部带宽从磁盘上滚出:无需查找,无需复杂的查询。我认为它不知道局部性——让每个任务跟踪器更喜欢来自同一节点上的 cassandra 进程的输入分割。
您可以尝试使用 Pig 和 STREAM 方法作为 hack,直到有更直接的 hadoop 流支持到位。
The win of using a direct InputFormat from cassandra is that it streams the data efficiently, which is a very big win. Each input split covers a range of tokens and rolls off the disk at its full bandwidth: no seeking, no complex querying. I don't think it knows about locality -- to have each tasktracker prefer input splits from a cassandra process on the same node.
You can try using Pig with the STREAM method as a hack until more direct hadoop streaming support is in place.