使用 Hive 自定义输入格式

发布于 2024-12-09 01:06:15 字数 1148 浏览 0 评论 0原文

更新:好吧,事实证明,以下代码不起作用的原因是我使用的是较新版本的 InputFormat API (import org.旧的 apache.hadoop.mapred 与新的 import org.apache.hadoop.mapreduce 相比)。我遇到的问题是将现有代码移植到新代码。有人有使用旧 API 编写多行 InputFormat 的经验吗?


尝试使用 Hadoop/Hive 处理 Omniture 的数据日志文件。文件格式是制表符分隔的,虽然在大多数情况下非常简单,但它们确实允许您在一个字段中拥有多个新行和制表符,并通过反斜杠转义(\\n\\t)。因此,我选择创建自己的 InputFormat 来处理多个换行符,并在 Hive 尝试对选项卡进行拆分时将这些选项卡转换为空格。我刚刚尝试将一些示例数据加载到 Hive 中的表中,但出现以下错误:

CREATE TABLE (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
STORED AS INPUTFORMAT 'OmnitureDataFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat';

FAILED: Error in semantic analysis: line 1:14 Input Format must implement InputFormat omniture_hit_data

奇怪的是我的输入格式确实扩展了 org.apache.hadoop.mapreduce.lib.input.TextInputFormat (https://gist.github.com/4a380409cd1497602906)。

Hive 是否要求您扩展 org.apache.hadoop.hive.ql.io.HiveInputFormat ?如果是这样,我是否必须重写 InputFormat 和 RecordReader 的任何现有类代码,或者我可以有效地更改它扩展的类吗?

Update: Alright, it turns out the reason that the below isn't working is because I'm using a newer version of the InputFormat API (import org.apache.hadoop.mapred which is the old versus import org.apache.hadoop.mapreduce which is the new). The problem I have is porting the existing code to new code. Has anyone had experience writing a multi-line InputFormat using the old API?


Trying to process Omniture's data log files with Hadoop/Hive. The file format is tab delimited and while being pretty simple for the most part, they do allow you to have multiple new lines and tabs within a field that are escaped by a backslash (\\n and \\t). As a result I've opted to create my own InputFormat to handle the multiple newlines and convert those tabs to spaces when Hive is going to try to do a split on the tabs. I've just tried loading some sample data into the table in Hive and got the following error:

CREATE TABLE (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
STORED AS INPUTFORMAT 'OmnitureDataFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat';

FAILED: Error in semantic analysis: line 1:14 Input Format must implement InputFormat omniture_hit_data

The odd thing is that my input format does extend org.apache.hadoop.mapreduce.lib.input.TextInputFormat (https://gist.github.com/4a380409cd1497602906).

Does Hive require that you extend org.apache.hadoop.hive.ql.io.HiveInputFormat instead? If so, do I have to rewrite any of my existing class code for the InputFormat and RecordReader or can I effectively just change the class it's extending?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

煮酒 2024-12-16 01:06:15

查看 LineReader 和 TextInputFormat 的代码后发现了这一点。创建了一个新的 InputFormat 以及 EscapedLineReader 来处理此问题。

https://github.com/msukmanowsky/OmnitureDataFileInputFormat

Figured this out after looking at the code for LineReader and TextInputFormat. Created a new InputFormat to deal with this as well as an EscapedLineReader.

https://github.com/msukmanowsky/OmnitureDataFileInputFormat

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文