Hive 在 HDFS 中的什么位置存储文件?

发布于 2024-10-18 10:03:51 字数 102 浏览 6 评论 0原文

我想知道如何找到 Hive 表和它们所代表的实际 HDFS 文件(或更确切地说,目录)之间的映射。我需要直接访问表文件。

Hive 将其文件存储在 HDFS 中的什么位置?

I'd like to know how to find the mapping between Hive tables and the actual HDFS files (or rather, directories) that they represent. I need to access the table files directly.

Where does Hive store its files in HDFS?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

谈情不如逗狗 2024-10-25 10:03:51

Hive 表不一定存储在仓库中(因为您可以创建位于 HDFS 上任何位置的表)。

您应该使用 DESCRIBE FORMATTED 命令。

hive -S -e "describe formatted <table_name> ;" | grep 'Location' | awk '{ print $NF }'

请注意,分区可能存储在不同的位置,要获取 alpha=foo/beta=bar 分区的位置,您必须添加 partition(alpha='foo',beta ='bar') 之后。

Hive tables may not necessarily be stored in a warehouse (since you can create tables located anywhere on the HDFS).

You should use DESCRIBE FORMATTED <table_name> command.

hive -S -e "describe formatted <table_name> ;" | grep 'Location' | awk '{ print $NF }'

Please note that partitions may be stored in different places and to get the location of the alpha=foo/beta=bar partition you'd have to add partition(alpha='foo',beta='bar') after <table_name>.

快乐很简单 2024-10-25 10:03:51

一旦您知道要查找的位置,就很容易找出它们在 HDFS 上的存储位置。 :)

如果您在浏览器中访问 http://NAMENODE_MACHINE_NAME:50070/,它应该会将您带到带有 浏览文件系统 链接的页面。

$HIVE_HOME/conf 目录中,有 hive-default.xml 和/或 hive-site.xml,其中包含 hive.metastore.warehouse.dir 属性。该值是您在单击浏览文件系统链接后想要导航到的位置。

在我的目录中,它是 /usr/hive/warehouse。一旦我导航到该位置,我就会看到我的表的名称。单击表名称(只是一个文件夹)将公开该表的分区。就我而言,我目前仅在日期对其进行分区。当我单击此级别的文件夹时,我将看到文件(分区越多,级别就越多)。这些文件是 HDFS 上实际存储数据的位置。

我没有尝试直接访问这些文件,我假设它可以完成。如果您正在考虑编辑它们,我会非常小心。 :)
对我来说 - 我会想出一种方法来完成我需要做的事情,而无需直接访问磁盘上的 Hive 数据。如果需要访问原始数据,可以使用 Hive 查询并将结果输出到文件。这些文件将具有与 HDFS 上的文件完全相同的结构(列之间的分隔符等)。我一直在做这样的查询并将它们转换为 CSV。

有关如何将查询数据写入磁盘的部分是 https:// /cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries

更新

自 Hadoop 3.0.0 - Alpha 1 以来,默认端口号发生了变化。 NAMENODE_MACHINE_NAME:50070 更改为 NAMENODE_MACHINE_NAME:9870。如果您在 Hadoop 3.x 上运行,请使用后者。 HDFS-9427 中描述了端口更改的完整列表

The location they are stored on the HDFS is fairly easy to figure out once you know where to look. :)

If you go to http://NAMENODE_MACHINE_NAME:50070/ in your browser it should take you to a page with a Browse the filesystem link.

In the $HIVE_HOME/conf directory there is the hive-default.xml and/or hive-site.xml which has the hive.metastore.warehouse.dir property. That value is where you will want to navigate to after clicking the Browse the filesystem link.

In mine, it's /usr/hive/warehouse. Once I navigate to that location, I see the names of my tables. Clicking on a table name (which is just a folder) will then expose the partitions of the table. In my case, I currently only have it partitioned on date. When I click on the folder at this level, I will then see files (more partitioning will have more levels). These files are where the data is actually stored on the HDFS.

I have not attempted to access these files directly, I'm assuming it can be done. I would take GREAT care if you are thinking about editing them. :)
For me - I'd figure out a way to do what I need to without direct access to the Hive data on the disk. If you need access to raw data, you can use a Hive query and output the result to a file. These will have the exact same structure (divider between columns, ect) as the files on the HDFS. I do queries like this all the time and convert them to CSVs.

The section about how to write data from queries to disk is https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries

UPDATE

Since Hadoop 3.0.0 - Alpha 1 there is a change in the default port numbers. NAMENODE_MACHINE_NAME:50070 changes to NAMENODE_MACHINE_NAME:9870. Use the latter if you are running on Hadoop 3.x. The full list of port changes are described in HDFS-9427

完美的未来在梦里 2024-10-25 10:03:51

在 Hive 终端中输入:(

hive> set hive.metastore.warehouse.dir;

它将打印路径)

In Hive terminal type:

hive> set hive.metastore.warehouse.dir;

(it will print the path)

长伴 2024-10-25 10:03:51

在 hive cli 中输入 show create table 也很可能会为您提供 Hive 表的确切位置。

It's also very possible that typing show create table <table_name> in the hive cli will give you the exact location of your hive table.

浮萍、无处依 2024-10-25 10:03:51

总结一下之前发的几点
在 hive-site.xml 中,属性 hive.metastore.warehouse.dir 指定文件位于 hadoop HDFS 下的位置

<property>
   <name>hive.metastore.warehouse.dir</name>
   <value>/user/hive/warehouse</value>
</property>

要查看文件,请使用此命令:

hadoop fs -ls /user/hive/warehouse

http://localhost:50070
Utilities > Browse the file system
or
http://localhost:50070/explorer.html#/

在 hadoop-2.7.3、hive-2.1.1 下测试

Summarize few points posted earlier,
in hive-site.xml, property hive.metastore.warehouse.dir specifies where the files located under hadoop HDFS

<property>
   <name>hive.metastore.warehouse.dir</name>
   <value>/user/hive/warehouse</value>
</property>

To view files, use this command:

hadoop fs -ls /user/hive/warehouse

or

http://localhost:50070
Utilities > Browse the file system
or
http://localhost:50070/explorer.html#/

tested under hadoop-2.7.3, hive-2.1.1

人│生佛魔见 2024-10-25 10:03:51

描述 hive shell 内格式化的;

请注意显示表位置的“位置”值。

describe formatted <table_name>; inside hive shell.

Notice the "Location" value that shows the location of the table.

暗藏城府 2024-10-25 10:03:51

检查特定表存储位置的另一种方法是在 hive 交互界面上执行此查询:

show create table table_name;

,其中 table_name 是主题表的名称。

上述“客户”表查询的示例如下:

CREATE TABLE `customers`(
  `id` string, 
  `name` string)
COMMENT 'Imported by sqoop on 2016/03/01 13:01:49'
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  LINES TERMINATED BY '\n' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://quickstart.cloudera:8020/user/hive/warehouse/
   sqoop_workspace.db/customers'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='true', 
  'numFiles'='4', 
  'totalSize'='77', 
  'transient_lastDdlTime'='1456866115')

上例中的 LOCATION 是您应该关注的位置。那是你的 hive 仓库的 hdfs 位置。

如果您喜欢这个解决方案,请不要忘记点赞。干杯!

Another way to check where a specific table is stored would be execute this query on the hive interactive interface:

show create table table_name;

where table_name is the name of the subject table.

An example for the above query on 'customers' table would be something like this:

CREATE TABLE `customers`(
  `id` string, 
  `name` string)
COMMENT 'Imported by sqoop on 2016/03/01 13:01:49'
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  LINES TERMINATED BY '\n' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://quickstart.cloudera:8020/user/hive/warehouse/
   sqoop_workspace.db/customers'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='true', 
  'numFiles'='4', 
  'totalSize'='77', 
  'transient_lastDdlTime'='1456866115')

LOCATION in the example above is where you should focus on. That is your hdfs location for hive warehouse.

Don't forget to like if you like this solution. Cheers!

惟欲睡 2024-10-25 10:03:51

Hive 数据库只不过是 HDFS 中带有 .db 扩展名的目录。

因此,从连接到 HDFS 的 Unix 或 Linux 主机,根据 HDFS 发行版类型按以下方式搜索:

hdfs dfs -ls -R / 2>/dev/null|grep db
或者
hadoop fs -ls -R / 2>/dev/null|grep db

您将看到.db数据库目录的完整路径。所有表都将驻留在各自的 .db 数据库目录下。

Hive database is nothing but directories within HDFS with .db extensions.

So, from a Unix or Linux host which is connected to HDFS, search by following based on type of HDFS distribution:

hdfs dfs -ls -R / 2>/dev/null|grep db
or
hadoop fs -ls -R / 2>/dev/null|grep db

You will see full path of .db database directories. All tables will be residing under respective .db database directories.

祁梦 2024-10-25 10:03:51

Hive表存储在Hive仓库目录中。
默认情况下,MapR 将 Hive 仓库目录配置为根卷下的 /user/hive/warehouse。此默认值在 $HIVE_HOME/conf/hive-default.xml 中定义。

Hive tables are stored in the Hive warehouse directory.
By default, MapR configures the Hive warehouse directory to be /user/hive/warehouse under the root volume. This default is defined in the $HIVE_HOME/conf/hive-default.xml.

王权女流氓 2024-10-25 10:03:51

在 Sandbox 中,您需要前往 /apps/hive/warehouse/ 和普通集群 /user/hive/warehouse

In Sandbox, you need to go for /apps/hive/warehouse/ and normal cluster /user/hive/warehouse

独闯女儿国 2024-10-25 10:03:51

如果您查看 hive-site.xml 文件,您将看到类似这样的内容

<property>
   <name>hive.metastore.warehouse.dir</name>
   <value>/usr/hive/warehouse </value>
   <description>location of the warehouse directory</description>
 </property>

/usr/hive/warehouse 是所有托管表的默认位置。
外部表可以存储在不同的位置。

describe formatted 是 hive shell 命令,可更广泛地用于查找与 hive 表相关的数据的位置。

If you look at the hive-site.xml file you will see something like this

<property>
   <name>hive.metastore.warehouse.dir</name>
   <value>/usr/hive/warehouse </value>
   <description>location of the warehouse directory</description>
 </property>

/usr/hive/warehouse is the default location for all managed tables.
External tables may be stored at a different location.

describe formatted <table_name> is the hive shell command which can be use more generally to find the location of data pertaining to a hive table.

述情 2024-10-25 10:03:51

在 Hive 中,表实际上存储在几个地方。具体来说,如果您使用分区(如果您的表非常大或不断增长,则应该这样做),那么每个分区都可以拥有自己的存储。

如果通过默认 HIVE 命令创建表数据或分区,则显示将创建表数据或分区的默认位置:(insert overwrite ...partition... 等):

describe formatted dbname.tablename

显示表数据或分区的实际位置HIVE 表中的特定分区,而是执行以下操作:

describe formatted dbname.tablename partition (name=value)

如果您在文件系统中查看表“应该”存在的位置,并且发现那里没有文件,则很可能通过创建新分区来创建该表(通常是增量地),并且将该分区指向其他位置。这是从第三方每日导入等内容构建表格的好方法,这样可以避免复制文件或将它们多次存储在不同的地方。

In Hive, tables are actually stored in a few places. Specifically, if you use partitions (which you should, if your tables are very large or growing) then each partition can have its own storage.

To show the default location where table data or partitions will be created if you create them through default HIVE commands: (insert overwrite ... partition ... and such):

describe formatted dbname.tablename

To show the actual location of a particular partition within a HIVE table, instead do this:

describe formatted dbname.tablename partition (name=value)

If you look in your filesystem where a table "should" live, and you find no files there, it's very likely that the table is created (usually incrementally) by creating a new partition and pointing that partition at some other location. This is a great way of building tables from things like daily imports from third parties and such, which avoids having to copy the files around or storing them more than once in different places.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文