hive 中的分区列
我必须使用也是表一部分的列对 hive
中的表进行分区。
例如:
表:员工
列:员工Id、员工名称、员工Salary
我必须使用employeeSalary 对表进行分区。 所以我编写了以下查询:
CREATE TABLE employee (employeeId INT, employeeName STRING, employeeSalary INT) PARTITIONED BY (ds INT);
我在这里只使用了名称“ds”,因为它不允许我输入相同的名称employeeSalary
。
我正在做的事情对吗?另外,在将值插入表中时,我必须使用逗号分隔的文件。现在该文件由如下行组成: 2019,约翰,2000
作为一排。 如果我必须使用工资进行分区,我的第一个分区将是工资为 2000 的所有人。因此,查询将
LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE employee PARTITION (ds=2000);
再次在 100 个工资为 2000 的条目之后,我有接下来的 500 个工资为 4000 的条目。 所以我会再次发出查询:
LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE employee PARTITION (ds=4000);
请让我知道我是否正确......
I have to partition the table in hive
with a column which is also part of the table.
For eg:
Table: employee
Columns: employeeId, employeeName, employeeSalary
I have to partition the table using employeeSalary.
So I write the following query:
CREATE TABLE employee (employeeId INT, employeeName STRING, employeeSalary INT) PARTITIONED BY (ds INT);
I just used the name "ds" here as it did'nt allow me to put the same name employeeSalary
.
Is this right what I am doing? Also while inserting values into the table, I have to use a comma separated file. Now the file consists of row like:
2019,John,2000
as one row.
If I have to partition using salary my first partition would be all people for salary 2000. So the query would be
LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE employee PARTITION (ds=2000);
Again after 100 entries with salary as 2000, I have next 500 entries with salary as 4000.
So I would again fire the query:
LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE employee PARTITION (ds=4000);
PLEASE LET ME KNOW IF I AM RIGHT...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
以下是如何创建一个在您指定的列上进行分区的 Hive 表。
分区列在
PARTITIONED BY
部分中指定。在 Hive shell 中,您可以运行
describeemployee;
,它将显示表中的所有列。使用CREATE TABLE
,您应该看到 4 列,而不是您想要获取的 3 列。对于加载命令,您需要指定要写入的所有分区。 (我对这些不太熟悉,主要基于 http://wiki .apache.org/hadoop/Hive/LanguageManual/DML#Syntax
所以像
Here's how to create a hive table with a partition on the column you specified
The partition column is specified in the
PARTITIONED BY
section.In the Hive shell you can run
describe employee;
and it will show all the columns in the table. With yourCREATE TABLE
you should see 4 columns, not the 3 you are trying to get.For your load command, you will want to specify all the partitions to write into. (I'm not very familiar with these, mostly basing off of http://wiki.apache.org/hadoop/Hive/LanguageManual/DML#Syntax
So something like
以下是 hive 中分区表的工作原理:
1)分区列数据/值不存储在仓库中的实际数据文件中,而是存储在hive元存储中。
2)所以hive仓库目录下的数据文件中不应该有分区列数据。
对于您的问题,这些应该是步骤。
1)
创建表employee (employeeId INT, employeeName STRING) PARTITIONED BY (employeeSalary INT),存储为<您选择的格式>;
这将在您已创建表的 hive 元存储中创建一个条目有2列employeeId INT、employeeName STRING,并且有一个分区列employeeSalary INT。
2)创建一个临时表,比如emp_temp。
我假设您的输入文件是文本格式。
3)复制 emp_temp 表的仓库位置中的所有文件或运行以下查询(我假设您在 ./example/files 文件夹中拥有所有数据文件。)
LOAD DATA LOCAL INPATH './examples/ files/*.txt' 覆盖到表 emp_temp
。4)现在运行以下 hql (这将为您动态创建分区)
谢谢,
阿迪亚
Here is how partition table in hive works:
1) the partition column data/value is not stored in the actual data file in warehouse instead it is stored in the hive meta store.
2) so you should not have the partition column data in the data files in the hive warehouse directory.
for your problem these should be the steps .
1)
CREATE TABLE employee (employeeId INT, employeeName STRING ) PARTITIONED BY (employeeSalary INT) stored as <your choice of format>;
This will create an entry in the hive metastore that you have created a table with 2 columns employeeId INT, employeeName STRING and it is having one partition column employeeSalary INT.
2) create a temp table lets say emp_temp.
i am assuming your input files are in text format.
3) copy all you file in the warehouse location of emp_temp table OR run the following query( i am asuming you have all you data files in the ./example/files folder.)
LOAD DATA LOCAL INPATH './examples/files/*.txt' OVERWRITE INTO TABLE emp_temp
.4) now run the following hql ( this will create partitions dynamically for you)
Thanks,
aditya
也许,我认为,你应该首先将所有数据加载到一张表中,然后使用 Hive 扩展(多次插入):
然后,如果你愿意,你可以
Maybe, I think that, you should firstly load all the data into one table, then use Hive extension (multiple inserts):
Then, if you want, you can