使用 PIG 进行文本解析
我是 PIG 的新手,对此不太了解。我如何解析 PIG 中的文本?要读取字段的值,pig 中有一个位置参数的概念,例如 $0 对应于第一个字段,类似地,是否有任何像位置参数这样的功能可以读取整行。RADOOP 到底可以在哪里使用?
I am new to PIG don't know much about it.How can i parse a text in PIG? to read field's values there is a concept of positional parameter in pig for example $0 corresponds to first field similarly is there any feature like positional parameter that can read entire row.what is RADOOP where exactly it can be used?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您的问题表明您希望与数据进行某种交互模式,但该数据量很大。
RADOOP 是 R 和 Hadoop 的组合,它应该能够为您提供一个 GUI,通过使用 Hadoop Scale 处理的一些 R 统计分析来运行您的大数据。
同时,我建议您看一下 Google-Refine (http://code. google.com/p/google-refine/),您可以轻松下载并使用它运行数据证据流程。
借助 Google-Refine,您可以使用内置文本、日期和数字函数轻松解析数据。您还可以使用 Jython 来进一步增强所需的功能。它可以通过对数据进行采样来处理大规模数据,并使用内置 Facets 研究其功能。
R 是也是一个很棒的数据证据工具,具有良好的采样和其他统计分析库。但其界面基于命令行,面向高级统计学家和分析师,而不是普通用户。
Your question indicates that you would like to have some kind of interactive mode with your data, but that this data has a large volume.
RADOOP is a combination of R and Hadoop and it should be able to provide you with a GUI to run your Big Data through some R Statistical Analysis using Hadoop Scale processing.
In the meanwhile I suggest you to take a look at Google-Refine (http://code.google.com/p/google-refine/), which you can easily download and run your Data Evidence process with it.
With Google-Refine you can easily parse your data, using built-in text, date and numeric functions. You can also use Jython for further enhancing the needed functionality. It can handle a large scale with sampling your data and investigate its features using built-in Facets.
R is also a great tool for Data Evidence, with good sampling and other statistical analysis libraries. But its interface is based on command-line and it is targeted at advanced statistician and analysts, and not for the common user.
对于文本解析,首先可以阅读PIG的教程和wordcount示例。
链接如下:
猪教程
字数统计示例 - 从此链接阅读字数统计示例并关联教程中给出的命令。
For text parsing , first of all you can read from the tutorials of PIG and the wordcount example.
Links given below :
Pig tutorial
Wordcount example - Read the wordcount example from this link and relate the commands given in the tutorial.
我不太确定你在问什么。 Pig 有许多有用的功能,例如 TOKENIZE 和正则表达式匹配/提取 UDF。当然,您也可以用 Java 或 Python 编写任何您喜欢的文本处理代码,并调用它。
I am not really sure what you are asking. Pig has a number of functions such as TOKENIZE and regex matching / extraction UDFs which can be helpful. Naturally, you can write any text processing code you like in Java or Python, too, and invoke it.
我想您是在要求不要标记整行,只需将整行作为一个字段,对吗?
然后,我认为你可以使用 PigStorage('\n'),使用 '\n' 作为字段分隔符,将整行视为一个字段。
我认为你的“RADOOP”是指 hadoop,对吧?第一步,您可以在本地模式下运行 Pig,这意味着您不需要安装 hadoop。
I guess you are asking for not tokenize the entire row, just take the entire row as an field, right ?
Then, I think you can use PigStorage('\n'), use '\n' as the field delimiter to treat the entire row as one field.
And I think your "RADOOP" mean hadoop, right ? As a first step, you can run pig in local mode, which means you do not need to install hadoop.