我们如何使用 Apache Pig 处理未格式化的数据?
我想使用 Apache Pig,但到目前为止我刚刚解析了 csv 或逗号分隔等格式化数据。
但是如果我有一些用 ';' 分隔的数据& '@&@' 等,我该如何使用它?
就像我使用 MapReduce 时一样,我用“;”分割数据。在map中,然后在reduce中再次通过“@&@”。
还假设我们有一个csv文件,其第一个字段用户名由“FirstnameLastname”格式制成,
raw = LOAD 'log.csv' USING PigStorage(',') AS (username: chararray, site: chararray, viwes: int);
通过上面的示例,我们可以获得整个用户名,但我怎样才能让名字和姓氏都不同呢?
I want to use Apache pig
, but until now I have just parsed formatted data like csv or comma separated etc.
But if I have some data separated by ';' & '@&@' etc, how can I work with it?
Like when I used MapReduce I split data by ";" in map and then again by "@&@" in reduce.
Also suppose for example we have a csv file with first field username which is made by "FirstnameLastname" format,
raw = LOAD 'log.csv' USING PigStorage(',') AS (username: chararray, site: chararray, viwes: int);
By above example we can just get whole username, but how can I get both Name and Lastname different?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用 Pig 中的 UDF 执行 Java 或 Python 可以执行的任何操作。 Pig 并不打算拥有一套详尽的处理功能,而只是提供基本功能。 Piggybank 通过收集一堆社区贡献的 UDF 来填补自定义代码的空缺。有时,存钱罐根本没有你需要的东西。 UDF 编写起来非常简单,这是一件好事。
您可以编写一个自定义加载程序它在加载时处理数据的独特结构。自定义加载函数使用 Java 代码操作数据并输出 Pig 正在寻找的结构化柱状格式。客户加载程序的另一个好处是您可以指定加载模式,因此您不必写出
AS (...)
您可以编写 自定义评估函数。有时像
SPLIT
或TOKENIZE
这样的函数还不够好。使用 TextLoader 逐行获取数据,然后使用 UDF 解析该行并输出一个元组(然后可以将其展平为列)。You can do just about anything Java or Python can do with UDFs in Pig. Pig is not intended to have an exhaustive set of processing functions, but just provide basic functionality. Piggybank fills the niche of custom code for doing stuff by collecting a bunch of community-contributed UDFs. Sometimes, piggybank just doesn't have what you need. It's a good thing UDFs are pretty simple to write.
You could write a custom loader that handles the unique structure of your data at load time. The custom load function manipulates the data with Java code and outputs its structured columnar format that Pig is looking for. Another nice thing about customer loaders is you can specify the load schema so you don't have to write out the
AS (...)
You could write a custom evaluation function. Sometimes a function like
SPLIT
orTOKENIZE
just isn't good enough. Use TextLoader to get your data in line-by-line, and then following up with a UDF to parse that line and output a tuple (which can then be flattened into columns).也许您可以使用 STRSPLIT 第二次分割字符串:
另外
;
可以被\\u003B
分割Maybe you can use STRSPLIT to split the string the second time:
Also
;
could be split by\\u003B