如何从 Hadoop Pig 处理的文件中修剪标题行？

发布于 2024-10-23 19:26:16 字数 599 浏览 8 评论 0原文

我正在尝试通过 Pig 程序使用 Amazon 的 Elastic Map Reduce 来解析由我们的服务生成的制表符分隔的数据文件。一切进展顺利，只是我们所有的数据文件都包含一个标题行，该标题行定义了每列的用途。显然，（字符串）标头不能转换为数字数据值，因此我从 Pig 收到如下警告：

2011-03-17 22:49:55,378 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.builtin.PigStorage: Unable to interpret value [<snip>] in field being converted to double, caught NumberFormatException <For input string: "headerName"> field discarded

在加载语句之后我有一个过滤器，试图确保我以后不会对任何数据进行操作标题行（通过过滤掉标题术语），但我想摆脱警告噪音以避免掩盖任何潜在的问题（例如未正确转换的实际数据字段）。

这可能吗？

原文

I am trying to parse tab separated data files generated by our services using Amazon's Elastic Map Reduce via a Pig program. Things are going well except that all of our data files contain a header row that defines the purpose of each column. Obviously, the (string) headers can't be cast to numeric data values, so I get warnings from Pig like the following:

2011-03-17 22:49:55,378 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.builtin.PigStorage: Unable to interpret value [<snip>] in field being converted to double, caught NumberFormatException <For input string: "headerName"> field discarded

I've got a filter after the load statement that tries to ensure that I don't later operate on any header lines (by filtering out header terms), but I'd like to get rid of the warning noise to avoid masking any potential problems (like actual data fields that don't cast properly).

Is this possible?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

写下不归期 2024-10-30 19:26:16

如果您不习惯编写 UDF，另一种选择可能是这样的：

示例数据：

MyIntVal
123
456

脚本：

A = load 's3://blah/myFile' USING PigStorage() as (myintval: chararray);

B = filter A by myintval neq 'MyIntVal';

C = foreach B generate (int)$0;

这将过滤掉标题行，然后将剩余的值转换为 int。

并不是说这是最好的方法，但如果它适合您的情况，这是另一种非常简单的选择。

Another option, if you're not comfortable with writing a UDF, could be something like this:

Sample data:

MyIntVal
123
456

Script:

A = load 's3://blah/myFile' USING PigStorage() as (myintval: chararray);

B = filter A by myintval neq 'MyIntVal';

C = foreach B generate (int)$0;

This will filter the header row out, then cast your remaining values to int.

Not saying this is the best way to do it, but it's another option that is pretty simple if it fits your situation.

回复收藏 0 原文

瑕疵 2024-10-30 19:26:16

您可以在提交 Pig 作业之前执行此操作（如果可能），或者尝试编写 UDF，如果满足某些条件，该 UDF 将发出空值，以便稍后您可以将其过滤掉。

回复收藏 0 原文

往日 2024-10-30 19:26:16

这可能会帮助您获得结果：-

input_file = load 'input' using PigStorage(',') as (row1:chararay, row2:chararray);
ranked = rank input_file;
/* ranked:{rank_input_file:long, row1:chararay, row2:chararay} */
NoHeader = filter ranked by (rank_input_file > 1);
New_input_file = foreach NoHeader generate row1, row2;

This may help you to get your result:-

input_file = load 'input' using PigStorage(',') as (row1:chararay, row2:chararray);
ranked = rank input_file;
/* ranked:{rank_input_file:long, row1:chararay, row2:chararay} */
NoHeader = filter ranked by (rank_input_file > 1);
New_input_file = foreach NoHeader generate row1, row2;

回复收藏 0 原文

~没有更多了~