具有多个分隔符的 Pig 加载问题
我有一些数据日志行,例如
Sep 10 12:00:01 10.100.2.28 t: |US,en,5,7350,100,0.076241,0.105342,-1,0,1,5,2,14,,,0,5134,7f378ecef7,fec81ebe-468a-4ac7-b472-8bd1ee88bfc2
Sep 10 12:00:01 10.100.2.28 t: |US,en,3,22427,100,0.05816,0.04018,-1,0,1,15,15,0,24383,cyclops.untd.com/,0,2796,2c5de71073,4858b748-121a-4f60-8087-97a8527d57c6
Sep 10 12:00:01 10.100.2.28 t: |us,en,6,16839,100,-1,-1,-1,17,1,0,-1,0,13819,d.tradex.openx.com/,0,-1,,4f805e3b-86b7-4dee-ae68-24e726cde954
“否”,因为很明显有两个分隔符(逗号和空格)..在使用 PigStorage 函数时,我想我只能使用其中之一..这让我得到了另一个分隔符的字符数组带有其他分隔符(空格或逗号)的字符串。
我想访问该字符数组的每个成员,但无法这样做。我也尝试过 TOKENIZE,但这给出了一个袋子,我不认为袋子中的物品是有序的,因此可以单独访问......
僧侣任何帮助将不胜感激......
Tanuj
I have some data log lines like
Sep 10 12:00:01 10.100.2.28 t: |US,en,5,7350,100,0.076241,0.105342,-1,0,1,5,2,14,,,0,5134,7f378ecef7,fec81ebe-468a-4ac7-b472-8bd1ee88bfc2
Sep 10 12:00:01 10.100.2.28 t: |US,en,3,22427,100,0.05816,0.04018,-1,0,1,15,15,0,24383,cyclops.untd.com/,0,2796,2c5de71073,4858b748-121a-4f60-8087-97a8527d57c6
Sep 10 12:00:01 10.100.2.28 t: |us,en,6,16839,100,-1,-1,-1,17,1,0,-1,0,13819,d.tradex.openx.com/,0,-1,,4f805e3b-86b7-4dee-ae68-24e726cde954
No as it is evident there are two delimiters (comma and space) .. While using the PigStorage function, I think I can only use one of them .... That leaves me with chararray of the other string with the other delimiter (space or comma).
I want to access each member of that chararray but cannot do so. I have also tried TOKENIZE but that gives a bag and I don't think items in a bag are ordered and thus can be accessed individually ...
Monks any help would be greatly appreciated ...
Tanuj
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以编写自己的自定义用户定义的加载函数可以以您想要的任何方式处理加载。通常,如果您的格式是某种奇怪的自定义格式,您将被迫这样做。您还可以获得一个很好的功能,即让自定义加载程序自动命名列。
您的另一个选择是在数据进入 Pig 之前对其进行预处理以进行良好的分隔。我不确定您的数据是如何设置的或如何进入的,所以我不确定这是否可行。一般来说,进行一些数据整理和清理从来都不是一件坏事。
You can write your own custom user-defined load function that can handle the loading in any way you want. Usually, if your format is some sort of weird custom format, you are going to be stuck doing this. You can also get the nice feature of having your custom loader automatically name the columns.
Your other option would be to preprocess your data before it gets into Pig to be nicely delimited. I'm not sure how your data is set up or how it is coming in, so I'm not sure if this is possible. In general, a little data grooming and sanitization is never a bad thing.
我能想到的最简单的解决方案是使用内置的 PigStorage 加载器作为两个分隔符之一,然后使用 STRSPLIT 获取另一个分隔符。
示例(假设有 19 个逗号分隔字段,因为它看起来就是这样):
请注意,如果任何逗号分隔字段之间有空格,这将会中断。
Simplest solution I can think of would be to use the built in PigStorage loader for one of the two delimiters then STRSPLIT to get the other one.
Example (assuming there's 19 comma separated fields since that's what it looked like):
Note this would break if there were spaces between any of your comma delimited fields.
编写你自己的UDF,这将是解决你的问题的最好方法
write you own UDF, it will be the best way to solve your problem