hive 中的分区列

发布于 2024-10-22 06:16:24 字数 869 浏览 2 评论 0原文

我必须使用也是表一部分的列对 hive 中的表进行分区。

例如:

表:员工

列:员工Id、员工名称、员工Salary

我必须使用employeeSalary 对表进行分区。 所以我编写了以下查询:

 CREATE TABLE employee (employeeId INT, employeeName STRING, employeeSalary INT) PARTITIONED BY (ds INT); 

我在这里只使用了名称“ds”,因为它不允许我输入相同的名称employeeSalary

我正在做的事情对吗?另外,在将值插入表中时,我必须使用逗号分隔的文件。现在该文件由如下行组成: 2019,约翰,2000

作为一排。 如果我必须使用工资进行分区,我的第一个分区将是工资为 2000 的所有人。因此,查询将

LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE employee PARTITION (ds=2000);

再次在 100 个工资为 2000 的条目之后,我有接下来的 500 个工资为 4000 的条目。 所以我会再次发出查询:

LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE employee PARTITION (ds=4000);

请让我知道我是否正确......

I have to partition the table in hive with a column which is also part of the table.

For eg:

Table: employee

Columns: employeeId, employeeName, employeeSalary

I have to partition the table using employeeSalary.
So I write the following query:

 CREATE TABLE employee (employeeId INT, employeeName STRING, employeeSalary INT) PARTITIONED BY (ds INT); 

I just used the name "ds" here as it did'nt allow me to put the same name employeeSalary.

Is this right what I am doing? Also while inserting values into the table, I have to use a comma separated file. Now the file consists of row like:
2019,John,2000

as one row.
If I have to partition using salary my first partition would be all people for salary 2000. So the query would be

LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE employee PARTITION (ds=2000);

Again after 100 entries with salary as 2000, I have next 500 entries with salary as 4000.
So I would again fire the query:

LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE employee PARTITION (ds=4000);

PLEASE LET ME KNOW IF I AM RIGHT...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

烂柯人 2024-10-29 06:16:24

以下是如何创建一个在您指定的列上进行分区的 Hive 表。

CREATE TABLE employee (employeeId INT, employeeName STRING) PARTITIONED BY (employeeSalary INT);

分区列在 PARTITIONED BY 部分中指定。
在 Hive shell 中,您可以运行describeemployee;,它将显示表中的所有列。使用CREATE TABLE,您应该看到 4 列,而不是您想要获取的 3 列。

对于加载命令,您需要指定要写入的所有分区。 (我对这些不太熟悉,主要基于 http://wiki .apache.org/hadoop/Hive/LanguageManual/DML#Syntax

所以像

LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE employee PARTITION (employeeSalary=2000, employeeSalary=4000);

Here's how to create a hive table with a partition on the column you specified

CREATE TABLE employee (employeeId INT, employeeName STRING) PARTITIONED BY (employeeSalary INT);

The partition column is specified in the PARTITIONED BY section.
In the Hive shell you can run describe employee; and it will show all the columns in the table. With your CREATE TABLE you should see 4 columns, not the 3 you are trying to get.

For your load command, you will want to specify all the partitions to write into. (I'm not very familiar with these, mostly basing off of http://wiki.apache.org/hadoop/Hive/LanguageManual/DML#Syntax

So something like

LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE employee PARTITION (employeeSalary=2000, employeeSalary=4000);
梦幻的味道 2024-10-29 06:16:24

以下是 hive 中分区表的工作原理:
1)分区列数据/值不存储在仓库中的实际数据文件中,而是存储在hive元存储中。

2)所以hive仓库目录下的数据文件中不应该有分区列数据。

对于您的问题,这些应该是步骤。

1)

创建表employee (employeeId INT, employeeName STRING) PARTITIONED BY (employeeSalary INT),存储为<您选择的格式>;

这将在您已创建表的 hive 元存储中创建一个条目有2列employeeId INT、employeeName STRING,并且有一个分区列employeeSalary INT。

2)创建一个临时表,比如emp_temp。

CREATE TABLE emp_temp (employeeId INT, employeeName STRING,employeeSalary INT ) stored as text; 

我假设您的输入文件是文本格式。

3)复制 emp_temp 表的仓库位置中的所有文件或运行以下查询(我假设您在 ./example/files 文件夹中拥有所有数据文件。)

LOAD DATA LOCAL INPATH './examples/ files/*.txt' 覆盖到表 emp_temp

4)现在运行以下 hql (这将为您动态创建分区)

 INSERT OVERWRITE TABLE employee partition(employeeSalary) 
   SELECT employeeId , employeeName , employeeSalary 
   from emp_temp

谢谢,
阿迪亚

Here is how partition table in hive works:
1) the partition column data/value is not stored in the actual data file in warehouse instead it is stored in the hive meta store.

2) so you should not have the partition column data in the data files in the hive warehouse directory.

for your problem these should be the steps .

1)

CREATE TABLE employee (employeeId INT, employeeName STRING ) PARTITIONED BY (employeeSalary INT) stored as <your choice of format>;

This will create an entry in the hive metastore that you have created a table with 2 columns employeeId INT, employeeName STRING and it is having one partition column employeeSalary INT.

2) create a temp table lets say emp_temp.

CREATE TABLE emp_temp (employeeId INT, employeeName STRING,employeeSalary INT ) stored as text; 

i am assuming your input files are in text format.

3) copy all you file in the warehouse location of emp_temp table OR run the following query( i am asuming you have all you data files in the ./example/files folder.)

LOAD DATA LOCAL INPATH './examples/files/*.txt' OVERWRITE INTO TABLE emp_temp.

4) now run the following hql ( this will create partitions dynamically for you)

 INSERT OVERWRITE TABLE employee partition(employeeSalary) 
   SELECT employeeId , employeeName , employeeSalary 
   from emp_temp

Thanks,
aditya

a√萤火虫的光℡ 2024-10-29 06:16:24

也许,我认为,你应该首先将所有数据加载到一张表中,然后使用 Hive 扩展(多次插入):

FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT          EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] 
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] 
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;

然后,如果你愿意,你可以

from big_data_table
insert overwrite table table1 partiton (ds=2000)
select * where employeeId>0 && employeeId<101>
insert overwrite table table2 partition (ds=4000)
select * where employeeId>=101&&employeeId<=600

Maybe, I think that, you should firstly load all the data into one table, then use Hive extension (multiple inserts):

FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT          EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] 
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] 
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;

Then, if you want, you can

from big_data_table
insert overwrite table table1 partiton (ds=2000)
select * where employeeId>0 && employeeId<101>
insert overwrite table table2 partition (ds=4000)
select * where employeeId>=101&&employeeId<=600
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文