如何从 Hadoop Pig 处理的文件中修剪标题行?
我正在尝试通过 Pig 程序使用 Amazon 的 Elastic Map Reduce 来解析由我们的服务生成的制表符分隔的数据文件。一切进展顺利,只是我们所有的数据文件都包含一个标题行,该标题行定义了每列的用途。显然,(字符串)标头不能转换为数字数据值,因此我从 Pig 收到如下警告:
2011-03-17 22:49:55,378 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.builtin.PigStorage: Unable to interpret value [<snip>] in field being converted to double, caught NumberFormatException <For input string: "headerName"> field discarded
在加载语句之后我有一个过滤器,试图确保我以后不会对任何数据进行操作标题行(通过过滤掉标题术语),但我想摆脱警告噪音以避免掩盖任何潜在的问题(例如未正确转换的实际数据字段)。
这可能吗?
I am trying to parse tab separated data files generated by our services using Amazon's Elastic Map Reduce via a Pig program. Things are going well except that all of our data files contain a header row that defines the purpose of each column. Obviously, the (string) headers can't be cast to numeric data values, so I get warnings from Pig like the following:
2011-03-17 22:49:55,378 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.builtin.PigStorage: Unable to interpret value [<snip>] in field being converted to double, caught NumberFormatException <For input string: "headerName"> field discarded
I've got a filter after the load statement that tries to ensure that I don't later operate on any header lines (by filtering out header terms), but I'd like to get rid of the warning noise to avoid masking any potential problems (like actual data fields that don't cast properly).
Is this possible?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您不习惯编写 UDF,另一种选择可能是这样的:
示例数据:
脚本:
这将过滤掉标题行,然后将剩余的值转换为 int。
并不是说这是最好的方法,但如果它适合您的情况,这是另一种非常简单的选择。
Another option, if you're not comfortable with writing a UDF, could be something like this:
Sample data:
Script:
This will filter the header row out, then cast your remaining values to int.
Not saying this is the best way to do it, but it's another option that is pretty simple if it fits your situation.
您可以在提交 Pig 作业之前执行此操作(如果可能),或者尝试编写 UDF,如果满足某些条件,该 UDF 将发出空值,以便稍后您可以将其过滤掉。
You can do it before submitting Pig job (if possible), or try writing UDF that would emit null values if certain conditions are met, so later You could filter this out.
这可能会帮助您获得结果:-
This may help you to get your result:-