如何为 PIG 或 HIVE 中的行添加行号？

发布于 2025-01-05 05:52:18 字数 462 浏览 7 评论 0原文

我在使用 Apache Pig 添加行号时遇到问题。问题是我有一个 STR_ID 列，我想为 STR_ID 中的数据添加一个 ROW_NUM 列，即 STR_ID 的行号。

例如，以下是输入：

STR_ID
------------
3D64B18BC842
BAECEFA8EFB6
346B13E4E240
6D8A9D0249B4
9FD024AA52BA

如何获得如下输出：

   STR_ID    |   ROW_NUM
----------------------------
3D64B18BC842 |     1
BAECEFA8EFB6 |     2
346B13E4E240 |     3
6D8A9D0249B4 |     4
9FD024AA52BA |     5

使用 Pig 或 Hive 的答案是可以接受的。谢谢。

原文

I have a problem when adding row numbers using Apache Pig.
The problem is that I have a STR_ID column and I want to add a ROW_NUM column for the data in STR_ID, which is the row number of the STR_ID.

For example, here is the input:

STR_ID
------------
3D64B18BC842
BAECEFA8EFB6
346B13E4E240
6D8A9D0249B4
9FD024AA52BA

How do I get the output like:

   STR_ID    |   ROW_NUM
----------------------------
3D64B18BC842 |     1
BAECEFA8EFB6 |     2
346B13E4E240 |     3
6D8A9D0249B4 |     4
9FD024AA52BA |     5

Answers using Pig or Hive are acceptable. Thank you.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦中楼上月下 2025-01-12 05:52:18

在 Hive 中：

查询

select str_id,row_number() over() from tabledata;

输出

3D64B18BC842      1
BAECEFA8EFB6      2
346B13E4E240      3
6D8A9D0249B4      4
9FD024AA52BA      5

In Hive:

Query

select str_id,row_number() over() from tabledata;

Output

3D64B18BC842      1
BAECEFA8EFB6      2
346B13E4E240      3
6D8A9D0249B4      4
9FD024AA52BA      5

回复收藏 0 原文

林空鹿饮溪 2025-01-12 05:52:18

Facebook 发布许多 Hive UDF，包括 NumberRows。根据您的配置单元版本（我相信是 0.8），您可能需要向类添加一个属性（stateful=true）。

回复收藏 0 原文

歌枕肩 2025-01-12 05:52:18

Pig 0.11 引入了一个可用于此目的的 RANK 运算符。

回复收藏 0 原文

甜味超标? 2025-01-12 05:52:18

对于想了解 Pig 的人们，我发现（当前）最好的方法是编写自己的 UDF。
我想为袋子中的元组添加行号。这是代码：

import java.io.IOException;
import java.util.Iterator;
import org.apache.pig.EvalFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.data.DataType;

public class RowCounter extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
public DataBag exec(Tuple input) throws IOException {
    try {
        DataBag output = mBagFactory.newDefaultBag();
        DataBag bg = (DataBag)input.get(0);
        Iterator it = bg.iterator();
        Integer count = new Integer(1);
        while(it.hasNext())
            { Tuple t = (Tuple)it.next();
              t.append(count);
              output.add(t);
              count = count + 1;
            }

        return output;
    } catch (ExecException ee) {
        // error handling goes here
        throw ee;
    }
}
public Schema outputSchema(Schema input) {
     try{
         Schema bagSchema = new Schema();
         bagSchema.add(new Schema.FieldSchema(null, DataType.BAG));

         return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input),
                                                bagSchema, DataType.BAG));
     }catch (Exception e){
        return null;
     }
    }
}

此代码仅供参考。可能无法防错。

For folks wondering about Pig, I found the best way (currently) is to write your own UDF.
I wanted to add row numbers for tuples in a bag. This is the code for that:

import java.io.IOException;
import java.util.Iterator;
import org.apache.pig.EvalFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.data.DataType;

public class RowCounter extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
public DataBag exec(Tuple input) throws IOException {
    try {
        DataBag output = mBagFactory.newDefaultBag();
        DataBag bg = (DataBag)input.get(0);
        Iterator it = bg.iterator();
        Integer count = new Integer(1);
        while(it.hasNext())
            { Tuple t = (Tuple)it.next();
              t.append(count);
              output.add(t);
              count = count + 1;
            }

        return output;
    } catch (ExecException ee) {
        // error handling goes here
        throw ee;
    }
}
public Schema outputSchema(Schema input) {
     try{
         Schema bagSchema = new Schema();
         bagSchema.add(new Schema.FieldSchema(null, DataType.BAG));

         return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input),
                                                bagSchema, DataType.BAG));
     }catch (Exception e){
        return null;
     }
    }
}

This code is for reference only. Might not be error-proof.

回复收藏 0 原文

猫九 2025-01-12 05:52:18

在我的示例中，这对您来说是一个很好的答案

步骤 1. 定义 row_sequence() 函数来处理自动增加 ID

add jar /Users/trongtran/research/hadoop/dev/hive-0.9.0-bin/lib/hive-contrib-0.9.0.jar;
drop temporary function row_sequence;
create temporary function row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';

步骤 2. 插入唯一的 ID 和 ID。 STR

INSERT OVERWRITE TABLE new_table
SELECT 
    row_sequence(),
    STR_ID
FROM old_table;

This is good answer for you on my example

Step 1. Define row_sequence() function to process for auto increase ID

add jar /Users/trongtran/research/hadoop/dev/hive-0.9.0-bin/lib/hive-contrib-0.9.0.jar;
drop temporary function row_sequence;
create temporary function row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';

Step 2. Insert unique id & STR

INSERT OVERWRITE TABLE new_table
SELECT 
    row_sequence(),
    STR_ID
FROM old_table;

回复收藏 0 原文

不知在何时 2025-01-12 05:52:18

从版本 0.11 开始，hive 支持分析函数，如领先、落后以及行号

https://issues .apache.org/jira/browse/HIVE-896

回复收藏 0 原文

呆萌少年 2025-01-12 05:52:18

Hive 解决方案 -

select *
  ,rank() over (rand()) as row_num
  from table

或者，如果您想让行按 STR_ID 升序 -

select *
  ,rank() over (STR_ID,rank()) as row_num
  from table

Hive solution -

select *
  ,rank() over (rand()) as row_num
  from table

Or, if you want to have rows ascending by STR_ID -

select *
  ,rank() over (STR_ID,rank()) as row_num
  from table

回复收藏 0 原文

十二 2025-01-12 05:52:18

在蜂巢中：

select
str_id, ROW_NUMBER() OVER() as row_num 
from myTable;

In Hive:

select
str_id, ROW_NUMBER() OVER() as row_num 
from myTable;

回复收藏 0 原文

~没有更多了~

关于作者

无法言说的痛

暂无简介

文章

671 人气

关注发私信

5040234068

文章 0 评论 0

关注

樱花雨梦

文章 0 评论 0

关注

≈。彩虹

文章 0 评论 0

关注

雨轻弹

文章 0 评论 0

关注

血之狂魔

文章 0 评论 0

关注

qq_0bIjwE

文章 0 评论 0

友情链接

文江博客

如何为 PIG 或 HIVE 中的行添加行号？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

5040234068

樱花雨梦

≈。彩虹

雨轻弹

血之狂魔

qq_0bIjwE

友情链接

如何为 PIG 或 HIVE 中的行添加行号？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

5040234068

樱花雨梦

≈。彩虹

雨轻弹

血之狂魔

qq_0bIjwE

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。