与使用 java 相比,hbase/hadoop 中的流作业是否有任何功能损失?

发布于 2024-12-07 18:08:59 字数 415 浏览 1 评论 0原文

如果这是一个基本问题,请提前抱歉。我正在阅读一本关于 hbase 和学习的书,但书中的大多数示例(以及在线示例)都倾向于使用 Java(我猜是因为 hbase 是 java 原生的)。有一些 python 示例,我知道我可以使用 python 访问 hbase(使用 thrift 或其他模块),但我想知道其他功能?

例如,hbase 有一个“协处理器”功能,可以将数据推送到您进行计算的位置。这种类型是否适用于 python 或其他使用流式 hadoop 作业的应用程序?看来使用java,它可以知道你在做什么并相应地管理数据流,但这如何与流式传输一起工作呢?如果它不起作用,有没有办法获得这种类型的功能(通过流媒体而不切换到另一种语言)?

也许问这个问题的另一种方式是……非 Java 程序员可以做什么来在流式传输时获得 hadoop 功能的所有好处?

提前致谢!

Sorry in advance if this is a basic question. I'm reading a book on hbase and learing but most of the examples in the book(and well as online) tend to be using Java(I guess because hbase is native to java). There are a few python examples and I know I can access hbase with python(using thrift or other modules), but I'm wondering about additional functions?

For example, hbase has a 'coprocessors' function that pushs the data to where your doing your computing. Does this type work with python or other apps that are using streaming hadoop jobs? It seems with java, it can know what your doing and manage the data flow accordingly but how does this work with streaming? If it doesn't work, is there a way to get this type of functionality(via streaming without switching to another language)?

Maybe another way of asking this is..what can a non-java programmer do to get all the benefits of the features of hadoop when streaming?

Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

旧人九事 2024-12-14 18:08:59

据我所知,您正在谈论两个(或更多)完全不同的概念。

Hadoop Streaming”可以通过可执行文件流式传输数据(独立于您选择的编程语言)。使用流式传输时,不会有任何功能损失,因为该功能基本上是映射/减少从 hadoop 流获取的数据。

对于hadoop部分,您甚至可以使用pig或hive大数据查询语言来高效地完成工作。使用最新版本的 pig,您甚至可以在 python 中编写自定义函数并在 pig 脚本中使用它们。

尽管有工具可以让您使用您熟悉的语言,但不要忘记hadoop框架主要是用java编写的。有时您可能需要编写专门的输入格式;或者猪内部的 UDF 等。然后,对 java 的了解就会派上用场。

您的“Hbase 协处理器”示例与 hadoop 的流功能有点无关。 Hbase协处理器由两部分组成:服务器端部分、客户端部分。我非常确定 hbase 中会嵌入一些有用的服务器端协处理器并发布;但除此之外,您还需要编写自己的协处理器(坏消息:它是 java)。对于客户端,我相信您可以通过 Thrift 将它们与您最喜欢的编程语言一起使用,而不会出现太多问题。

所以作为你问题的答案:你总是可以逃避学习java;仍然使用 hadoop 来发挥其潜力(使用第 3 方库/应用程序)。但当事情发生时,最好还是了解其背后的内容;能够用java进行开发。了解 java 将使您能够完全控制 hadoop/hbase 环境。

希望您会发现这很有帮助。

As far as I know, you are talking about 2(or more) totally different concepts.

"Hadoop Streaming" is there to stream data through your executable (independent from your choice of programming language). When using streaming there can't be any loss of functionality, since the functionality is basicly map/reduce the data you are getting from hadoop stream.

For hadoop part you can even use pig or hive big data query languages to get things done efficiently. With the newest versions of pig you can even write custom functions in python and use them inside your pig scripts.

Although there are tools to make you use the language you are comfortable with never forget that hadoop framework is mostly written in java. There could be times when you would need to write a specialized InputFormat; or a UDF inside pig, etc. Then a decent knowledge in java would come handy.

Your "Hbase coprocessors" example is kinda unrelated with streaming functionality of hadoop. Hbase coproccessors consists of 2 parts : server-side part, client-side part. I am pretty sure there would be some useful server-side coprocessors embedded inside hbase with release; but other than that you would need to write your own coprocessor (and bad news: its java). For client side I am sure you would be able to use them with your favorite programming language through thrift without too much problem.

So as an answer to your question: you can always dodge learning java; still using hadoop to it's potential (using 3rd party libraries/applications). But when shit hits the fan its better to understand the underlaying content; to be able to develop with java. Knowing java would give you a full control over hadoop/hbase enviroment.

Hope you would find this helpful.

野鹿林 2024-12-14 18:08:59

是的,您应该通过流式传输来获取数据本地代码执行。您不是将数据推送到程序所在的位置,而是将程序推送到数据所在的位置。流式处理只是获取本地输入数据并通过 stdin 将其运行到您的 python 程序。它不是在 java 任务中运行每个映射,而是启动 python 程序的实例,并通过该实例泵送输入。

如果你真的想做快速处理,你真的应该学习java。必须通过 stdin 和 stout 传输所有内容是很大的开销。

Yes, you should get data local code execution with streaming. You do not push the data to where the program is, you push the program to where the data is. Streaming simply takes the local input data and runs it through stdin to your python program. Instead of each map running inside of a java task, it spins up and instance of your python program and just pumps the input through that.

If you really want to do fast processing you really should learn java though. Having to pipe everything through stdin and stout is a lot of overhead.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文