使用 FileFormat v Serde 读取自定义文本文件

发布于 2024-12-09 08:46:59 字数 157 浏览 0 评论 0原文

Hadoop/Hive 新手在这里。我正在尝试在 Hive 中使用以自定义文本格式存储的数据。我的理解是，您可以编写自定义 FileFormat 或自定义 SerDe 类来执行此操作。是这样还是我理解错了？关于何时选择哪个选项有哪些一般准则？谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ゝ杯具 2024-12-16 08:47:00

如果您使用 Hive，请编写一个 serde。请参阅这些示例：
https://github .com/apache/hive/tree/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2

请注意，此接口是 Hive 特定的。如果您想将自定义文件格式用于常规 hadoop 作业，则必须实现一个单独的接口（我不太确定是哪一个）。

如果您已经知道如何用另一种语言反序列化数据，则只需编写一个流作业（使用任何语言）并使用现有的库即可。

希望有帮助

回复收藏 0 原文

九厘米的零° 2024-12-16 08:46:59

我想通了。毕竟我不必编写 Serde，编写了一个自定义的 InputFormat （扩展 org.apache.hadoop.mapred.TextInputFormat ），它返回一个自定义的 RecordReader （实现 org.apache.hadoop.txt ）。 mapred.RecordReader)。 RecordReader 实现逻辑来读取和解析我的文件并返回制表符分隔的行。

我将我的表声明为

create table t2 ( 
field1 string, 
..
fieldNN float)        
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'    
STORED AS INPUTFORMAT 'namespace.CustomFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

This using a native SerDe。另外，使用自定义输入格式时需要指定输出格式，因此我选择其中一种内置输出格式。

I figured it out. I did not have to write a serde after all, wrote a custom InputFormat (extends org.apache.hadoop.mapred.TextInputFormat) which returns a custom RecordReader (implements org.apache.hadoop.mapred.RecordReader<K, V>). The RecordReader implements logic to read and parse my files and returns tab delimited rows.

With that I declared my table as

create table t2 ( 
field1 string, 
..
fieldNN float)        
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'    
STORED AS INPUTFORMAT 'namespace.CustomFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

This uses a native SerDe. Also, it is required to specify an output format when using a custom input format, so I choose one of the built-in output formats.

回复收藏 0 原文