在另一个关系上使用 FOREACH 时将关系传递给 PIG UDF？

发布于 2024-09-10 20:29:49 字数 1072 浏览 8 评论 0原文

我们正在使用 Pig 0.6 来处理一些数据。我们数据的一列是一个以空格分隔的 id 列表（例如：35 521 225）。我们正在尝试将其中一个 id 映射到另一个包含 2 列映射的文件，例如（因此第 1 列是我们的数据，第 2 列是第三方数据）：

35 6009
521 21599
225 51991
12 6129

我们编写了一个 UDF，它接收列值（例如：“35 521 225”）和文件中的映射。然后，我们将拆分列值并迭代每个列值，并从传入的映射中返回第一个映射值（认为这就是它在逻辑上的工作方式）。

我们像这样在 PIG 中加载数据：

data = LOAD 'input.txt' USING PigStorage() AS (name:chararray,category:chararray);

mappings = LOAD 'mappings.txt ' USING PigStorage() AS (ourId:chararray, theirId:chararray);

那么我们的生成是：

output = FOREACH data GENERATE title, com.example.ourudf.Mapper(category, mappings); code>

但是我们得到的错误是：
'解析过程中出现错误：[data::title: chararray,data::category, chararray] 中的别名映射无效`

看来 Pig 正在尝试在我们的原始数据上找到一个名为“mappings”的列。如果课程不存在的话。有什么方法可以传递加载到 UDF 中的关系吗？

PIG 中的“Map”类型有什么办法可以帮助我们吗？或者我们是否需要以某种方式加入这些价值观？

编辑：更具体地说 - 我们不想将所有类别 ID 映射到第 3 方 ID。我们只想绘制第一个地图。 UDF 将迭代我们的类别 ID 列表 - 并在找到第一个映射值时返回。因此，如果输入如下所示：

someProduct\t35 521 225

则输出将为：
一些产品\t6009

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

甜警司 2024-09-17 20:29:49

我不认为你能在 Pig 中做到这一点。

与您想要做的类似的解决方案是将映射文件加载到 UDF 中，然后处理 FOREACH 中的每条记录。 PiggyBank 中有一个示例 LookupInFiles。建议使用 DistributedCache 而不是复制直接从 DFS 文件。

DEFINE MAP_PRODUCT com.example.ourudf.Mapper('hdfs://../mappings.txt');

data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);

output = FOREACH data GENERATE title, MAP_PRODUCT(category);

如果您的映射文件不太大，这将起作用。如果它不适合内存，您将必须对映射文件进行分区并多次运行脚本，或者通过添加行号来调整映射文件的架构并使用本机 join 并为每个产品嵌套 FOREACH ORDER BY/LIMIT 1。

I don't think you can do it this wait in Pig.

A solution similar to what you wanted to do would be to load the mapping file in the UDF and then process each record in a FOREACH. An example is available in PiggyBank LookupInFiles. It is recommended to use the DistributedCache instead of copying the file directly from the DFS.

DEFINE MAP_PRODUCT com.example.ourudf.Mapper('hdfs://../mappings.txt');

data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);

output = FOREACH data GENERATE title, MAP_PRODUCT(category);

This will work if your mapping file is not too big. If it does not fit in memory you will have to partition the mapping file and run the script several time or tweak the mapping file's schema by adding a line number and use a native join and nested FOREACH ORDER BY/LIMIT 1 for each product.

回复收藏 0 原文

~没有更多了~