当前位置：文江博客话题详情

可以使用 PIG 读取的文件格式

发布于 2024-12-28 15:35:56 字数 142 浏览 1 评论 0原文

使用PIG可以读取哪些类型的文件格式？

如何以不同的格式存储它们？假设我们有 CSV 文件，我想将其存储为 MXL 文件，如何做到这一点？每当我们使用 STORE 命令时，它都会创建目录并将文件存储为part-m-00000 我如何更改文件名并覆盖目录？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

疑心病 2025-01-04 15:35:56

使用PIG可以读取哪些类型的文件格式？我如何以不同的格式存储它们？

有一些内置加载和存储方法，但它们是有限的：

BinStorage - “二进制”存储
PigStorage - 加载和存储由某些内容（例如制表符或逗号）分隔的
数据 TextLoader - 逐行加载数据（即由换行符分隔）

piggybank 是社区贡献的用户定义函数的库，它具有许多加载和存储方法，其中包括XML加载器，但不包括XML存储器。

假设我们有 CSV 文件，但想将其存储为 MXL 文件，如何做到这一点？

我假设你在这里指的是 XML...在 Hadoop 中以 XML 存储有点粗糙，因为它在减速器的基础上分割文件，那么你如何知道将根标签放在哪里呢？这可能应该是某种后处理以生成格式良好的 XML。

您可以做的一件事是编写一个 UDF将列转换为 XML 字符串：

B = FOREACH A GENERATE customudfs.DataToXML(col1, col2, col3);

例如，假设 col1、col2、col3 是 "foo"， 37，分别是“柠檬”。您的 UDF 可以输出字符串 "Foo37lemons"。

每当我们使用 STORE 命令时，它都会创建目录并将文件存储为part-m-00000 我如何更改文件名并覆盖目录？

您无法将输出文件的名称更改为 part-m-00000 以外的名称。这就是 Hadoop 的工作原理。如果您想更改它的名称，您应该在事后对其进行一些操作，例如hadoop fs -mv output/part-m-00000 newoutput/myoutputfile。这可以通过运行 Pig 脚本然后执行此命令的 bash 脚本来完成。

what kind of file formats can be read using PIG? how can i store them in different formats?

There are a few built-in loading and storing methods, but they are limited:

BinStorage - "binary" storage
PigStorage - loads and stores data that is delimited by something (such as tab or comma)
TextLoader - loads data line by line (i.e., delimited by the newline character)

piggybank is a library of community contributed user-defined functions and it has a number of loading and storing methods, which includes an XML loader, but not a XML storer.

say we have CSV file n i want to store it as MXL file how this can be done?

I assume you mean XML here... Storing in XML is something that is a bit rough in Hadoop because it splits files on a reducer basis, so how do you know where to put the root tag? this likely should be some sort of post-processing to produce wellformed XML.

One thing you can do is to write a UDF that converts your columns into an XML string:

B = FOREACH A GENERATE customudfs.DataToXML(col1, col2, col3);

For example, say col1, col2, col3 are "foo", 37, "lemons", respectively. Your UDF can output the string "<item><name>Foo</name><num>37</num><fruit>lemons</fruit></item>".

whenever we use STORE command it makes directory and it stores file as part-m-00000 how can i change name of the file and overwrite directory?

You can't change the name of the output file to be something other than part-m-00000. That's just how Hadoop works. If you want to change the name of it, you should do something to it after the fact with something like hadoop fs -mv output/part-m-00000 newoutput/myoutputfile. This could be done with a bash script that runs the pig script then executes this command.

回复收藏 0 原文

~没有更多了~