可以使用 PIG 读取的文件格式
使用PIG可以读取哪些类型的文件格式?
如何以不同的格式存储它们?假设我们有 CSV 文件,我想将其存储为 MXL 文件,如何做到这一点?每当我们使用 STORE 命令时,它都会创建目录并将文件存储为part-m-00000 我如何更改文件名并覆盖目录?
What kind of file formats can be read using PIG?
How can I store them in different formats? Say we have CSV file and I want to store it as MXL file how this can be done? Whenever we use STORE command it makes directory and it stores file as part-m-00000 how can I change name of the file and overwrite directory?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
有一些内置加载和存储方法,但它们是有限的:
piggybank 是社区贡献的用户定义函数的库,它具有 许多加载和存储方法,其中包括XML加载器,但不包括XML存储器。
我假设你在这里指的是 XML...在 Hadoop 中以 XML 存储有点粗糙,因为它在减速器的基础上分割文件,那么你如何知道将根标签放在哪里呢?这可能应该是某种后处理以生成格式良好的 XML。
您可以做的一件事是编写一个 UDF将列转换为 XML 字符串:
例如,假设
col1
、col2
、col3
是"foo"
,37
,分别是“柠檬”
。您的 UDF 可以输出字符串"Foo 37 lemons "
。您无法将输出文件的名称更改为
part-m-00000
以外的名称。这就是 Hadoop 的工作原理。如果您想更改它的名称,您应该在事后对其进行一些操作,例如hadoop fs -mv output/part-m-00000 newoutput/myoutputfile。这可以通过运行 Pig 脚本然后执行此命令的 bash 脚本来完成。There are a few built-in loading and storing methods, but they are limited:
piggybank is a library of community contributed user-defined functions and it has a number of loading and storing methods, which includes an XML loader, but not a XML storer.
I assume you mean XML here... Storing in XML is something that is a bit rough in Hadoop because it splits files on a reducer basis, so how do you know where to put the root tag? this likely should be some sort of post-processing to produce wellformed XML.
One thing you can do is to write a UDF that converts your columns into an XML string:
For example, say
col1
,col2
,col3
are"foo"
,37
,"lemons"
, respectively. Your UDF can output the string"<item><name>Foo</name><num>37</num><fruit>lemons</fruit></item>"
.You can't change the name of the output file to be something other than
part-m-00000
. That's just how Hadoop works. If you want to change the name of it, you should do something to it after the fact with something likehadoop fs -mv output/part-m-00000 newoutput/myoutputfile
. This could be done with a bash script that runs the pig script then executes this command.