在另一个关系上使用 FOREACH 时将关系传递给 PIG UDF?
我们正在使用 Pig 0.6 来处理一些数据。我们数据的一列是一个以空格分隔的 id 列表(例如:35 521 225)。我们正在尝试将其中一个 id 映射到另一个包含 2 列映射的文件,例如(因此第 1 列是我们的数据,第 2 列是第三方数据):
35 6009
521 21599
225 51991
12 6129
我们编写了一个 UDF,它接收列值(例如:“35 521 225”)和文件中的映射。然后,我们将拆分列值并迭代每个列值,并从传入的映射中返回第一个映射值(认为这就是它在逻辑上的工作方式)。
我们像这样在 PIG 中加载数据:
data = LOAD 'input.txt' USING PigStorage() AS (name:chararray,category:chararray);
mappings = LOAD 'mappings.txt ' USING PigStorage() AS (ourId:chararray, theirId:chararray);
那么我们的生成是:
output = FOREACH data GENERATE title, com.example.ourudf.Mapper(category, mappings);
code>
但是我们得到的错误是:
'解析过程中出现错误:[data::title: chararray,data::category, chararray] 中的别名映射无效`
看来 Pig 正在尝试在我们的原始数据上找到一个名为“mappings”的列。如果课程不存在的话。有什么方法可以传递加载到 UDF 中的关系吗?
PIG 中的“Map”类型有什么办法可以帮助我们吗?或者我们是否需要以某种方式加入这些价值观?
编辑:更具体地说 - 我们不想将所有类别 ID 映射到第 3 方 ID。我们只想绘制第一个地图。 UDF 将迭代我们的类别 ID 列表 - 并在找到第一个映射值时返回。因此,如果输入如下所示:
someProduct\t35 521 225
则输出将为:
一些产品\t6009
We are using Pig 0.6 to process some data. One of the columns of our data is a space-separated list of ids (such as: 35 521 225). We are trying to map one of those ids to another file that contains 2 columns of mappings like (so column 1 is our data, column 2 is a 3rd parties data):
35 6009
521 21599
225 51991
12 6129
We wrote a UDF that takes in the column value (so: "35 521 225") and the mappings from the file. We would then split the column value and iterate over each and return the first mapped value from the passed in mappings (thinking that is how it would logically work).
We are loading the data in PIG like this:
data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);
mappings = LOAD 'mappings.txt' USING PigStorage() AS (ourId:chararray, theirId:chararray);
Then our generate is:
output = FOREACH data GENERATE title, com.example.ourudf.Mapper(category, mappings);
However the error we get is:
'there is an error during parsing: Invalid alias mappings in [data::title: chararray,data::category, chararray]`
It seems that Pig is trying to find a column called "mappings" on our original data. Which if course isn't there. Is there any way to pass a relation that is loaded into a UDF?
Is there any way the "Map" type in PIG will help us here? Or do we need to somehow join the values?
EDIT: To be more specific - we don't want to map ALL of the category ids to the 3rd party ids. We just wanted to map the first. The UDF will iterate over the list of our category ids - and will return when it finds the first mapped value. So if the input looked like:
someProduct\t35 521 225
the output would be:
someProduct\t6009
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我不认为你能在 Pig 中做到这一点。
与您想要做的类似的解决方案是将映射文件加载到 UDF 中,然后处理 FOREACH 中的每条记录。 PiggyBank 中有一个示例 LookupInFiles。建议使用 DistributedCache 而不是复制直接从 DFS 文件。
如果您的映射文件不太大,这将起作用。如果它不适合内存,您将必须对映射文件进行分区并多次运行脚本,或者通过添加行号来调整映射文件的架构并使用本机 join 并为每个产品嵌套 FOREACH ORDER BY/LIMIT 1。
I don't think you can do it this wait in Pig.
A solution similar to what you wanted to do would be to load the mapping file in the UDF and then process each record in a FOREACH. An example is available in PiggyBank LookupInFiles. It is recommended to use the DistributedCache instead of copying the file directly from the DFS.
This will work if your mapping file is not too big. If it does not fit in memory you will have to partition the mapping file and run the script several time or tweak the mapping file's schema by adding a line number and use a native join and nested FOREACH ORDER BY/LIMIT 1 for each product.