Hive 在 HDFS 中的什么位置存储文件?
我想知道如何找到 Hive 表和它们所代表的实际 HDFS 文件(或更确切地说,目录)之间的映射。我需要直接访问表文件。
Hive 将其文件存储在 HDFS 中的什么位置?
I'd like to know how to find the mapping between Hive tables and the actual HDFS files (or rather, directories) that they represent. I need to access the table files directly.
Where does Hive store its files in HDFS?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
Hive 表不一定存储在仓库中(因为您可以创建位于 HDFS 上任何位置的表)。
您应该使用
DESCRIBE FORMATTED
命令。请注意,分区可能存储在不同的位置,要获取
alpha=foo/beta=bar
分区的位置,您必须添加partition(alpha='foo',beta ='bar')
在
之后。Hive tables may not necessarily be stored in a warehouse (since you can create tables located anywhere on the HDFS).
You should use
DESCRIBE FORMATTED <table_name>
command.Please note that partitions may be stored in different places and to get the location of the
alpha=foo/beta=bar
partition you'd have to addpartition(alpha='foo',beta='bar')
after<table_name>
.一旦您知道要查找的位置,就很容易找出它们在 HDFS 上的存储位置。 :)
如果您在浏览器中访问
http://NAMENODE_MACHINE_NAME:50070/
,它应该会将您带到带有浏览文件系统
链接的页面。在
$HIVE_HOME/conf
目录中,有hive-default.xml
和/或hive-site.xml
,其中包含hive.metastore.warehouse.dir
属性。该值是您在单击浏览文件系统
链接后想要导航到的位置。在我的目录中,它是
/usr/hive/warehouse
。一旦我导航到该位置,我就会看到我的表的名称。单击表名称(只是一个文件夹)将公开该表的分区。就我而言,我目前仅在日期
对其进行分区。当我单击此级别的文件夹时,我将看到文件(分区越多,级别就越多)。这些文件是 HDFS 上实际存储数据的位置。我没有尝试直接访问这些文件,我假设它可以完成。如果您正在考虑编辑它们,我会非常小心。 :)
对我来说 - 我会想出一种方法来完成我需要做的事情,而无需直接访问磁盘上的 Hive 数据。如果需要访问原始数据,可以使用 Hive 查询并将结果输出到文件。这些文件将具有与 HDFS 上的文件完全相同的结构(列之间的分隔符等)。我一直在做这样的查询并将它们转换为 CSV。
有关如何将查询数据写入磁盘的部分是 https:// /cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries
更新
自 Hadoop 3.0.0 - Alpha 1 以来,默认端口号发生了变化。 NAMENODE_MACHINE_NAME:50070 更改为 NAMENODE_MACHINE_NAME:9870。如果您在 Hadoop 3.x 上运行,请使用后者。 HDFS-9427 中描述了端口更改的完整列表
The location they are stored on the HDFS is fairly easy to figure out once you know where to look. :)
If you go to
http://NAMENODE_MACHINE_NAME:50070/
in your browser it should take you to a page with aBrowse the filesystem
link.In the
$HIVE_HOME/conf
directory there is thehive-default.xml
and/orhive-site.xml
which has thehive.metastore.warehouse.dir
property. That value is where you will want to navigate to after clicking theBrowse the filesystem
link.In mine, it's
/usr/hive/warehouse
. Once I navigate to that location, I see the names of my tables. Clicking on a table name (which is just a folder) will then expose the partitions of the table. In my case, I currently only have it partitioned ondate
. When I click on the folder at this level, I will then see files (more partitioning will have more levels). These files are where the data is actually stored on the HDFS.I have not attempted to access these files directly, I'm assuming it can be done. I would take GREAT care if you are thinking about editing them. :)
For me - I'd figure out a way to do what I need to without direct access to the Hive data on the disk. If you need access to raw data, you can use a Hive query and output the result to a file. These will have the exact same structure (divider between columns, ect) as the files on the
HDFS
. I do queries like this all the time and convert them to CSVs.The section about how to write data from queries to disk is https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries
UPDATE
Since Hadoop 3.0.0 - Alpha 1 there is a change in the default port numbers. NAMENODE_MACHINE_NAME:50070 changes to NAMENODE_MACHINE_NAME:9870. Use the latter if you are running on Hadoop 3.x. The full list of port changes are described in HDFS-9427
在 Hive 终端中输入:(
它将打印路径)
In Hive terminal type:
(it will print the path)
在 hive cli 中输入
show create table
也很可能会为您提供 Hive 表的确切位置。It's also very possible that typing
show create table <table_name>
in the hive cli will give you the exact location of your hive table.总结一下之前发的几点
在 hive-site.xml 中,属性 hive.metastore.warehouse.dir 指定文件位于 hadoop HDFS 下的位置
要查看文件,请使用此命令:
或
在 hadoop-2.7.3、hive-2.1.1 下测试
Summarize few points posted earlier,
in hive-site.xml, property hive.metastore.warehouse.dir specifies where the files located under hadoop HDFS
To view files, use this command:
or
tested under hadoop-2.7.3, hive-2.1.1
描述 hive shell 内格式化的;
。请注意显示表位置的“位置”值。
describe formatted <table_name>;
inside hive shell.Notice the "Location" value that shows the location of the table.
检查特定表存储位置的另一种方法是在 hive 交互界面上执行此查询:
,其中 table_name 是主题表的名称。
上述“客户”表查询的示例如下:
上例中的 LOCATION 是您应该关注的位置。那是你的 hive 仓库的 hdfs 位置。
如果您喜欢这个解决方案,请不要忘记点赞。干杯!
Another way to check where a specific table is stored would be execute this query on the hive interactive interface:
where table_name is the name of the subject table.
An example for the above query on 'customers' table would be something like this:
LOCATION in the example above is where you should focus on. That is your hdfs location for hive warehouse.
Don't forget to like if you like this solution. Cheers!
Hive 数据库只不过是 HDFS 中带有 .db 扩展名的目录。
因此,从连接到 HDFS 的 Unix 或 Linux 主机,根据 HDFS 发行版类型按以下方式搜索:
hdfs dfs -ls -R / 2>/dev/null|grep db
或者
hadoop fs -ls -R / 2>/dev/null|grep db
您将看到.db数据库目录的完整路径。所有表都将驻留在各自的 .db 数据库目录下。
Hive database is nothing but directories within HDFS with .db extensions.
So, from a Unix or Linux host which is connected to HDFS, search by following based on type of HDFS distribution:
hdfs dfs -ls -R / 2>/dev/null|grep db
or
hadoop fs -ls -R / 2>/dev/null|grep db
You will see full path of .db database directories. All tables will be residing under respective .db database directories.
Hive表存储在Hive仓库目录中。
默认情况下,MapR 将 Hive 仓库目录配置为根卷下的 /user/hive/warehouse。此默认值在 $HIVE_HOME/conf/hive-default.xml 中定义。
Hive tables are stored in the Hive warehouse directory.
By default, MapR configures the Hive warehouse directory to be /user/hive/warehouse under the root volume. This default is defined in the $HIVE_HOME/conf/hive-default.xml.
在 Sandbox 中,您需要前往
/apps/hive/warehouse/
和普通集群/user/hive/warehouse
In Sandbox, you need to go for
/apps/hive/warehouse/
and normal cluster/user/hive/warehouse
如果您查看 hive-site.xml 文件,您将看到类似这样的内容
/usr/hive/warehouse 是所有托管表的默认位置。
外部表可以存储在不同的位置。
describe formatted
是 hive shell 命令,可更广泛地用于查找与 hive 表相关的数据的位置。If you look at the hive-site.xml file you will see something like this
/usr/hive/warehouse is the default location for all managed tables.
External tables may be stored at a different location.
describe formatted <table_name>
is the hive shell command which can be use more generally to find the location of data pertaining to a hive table.在 Hive 中,表实际上存储在几个地方。具体来说,如果您使用分区(如果您的表非常大或不断增长,则应该这样做),那么每个分区都可以拥有自己的存储。
如果通过默认 HIVE 命令创建表数据或分区,则显示将创建表数据或分区的默认位置:(
insert overwrite ...partition...
等):显示表数据或分区的实际位置HIVE 表中的特定分区,而是执行以下操作:
如果您在文件系统中查看表“应该”存在的位置,并且发现那里没有文件,则很可能通过创建新分区来创建该表(通常是增量地),并且将该分区指向其他位置。这是从第三方每日导入等内容构建表格的好方法,这样可以避免复制文件或将它们多次存储在不同的地方。
In Hive, tables are actually stored in a few places. Specifically, if you use partitions (which you should, if your tables are very large or growing) then each partition can have its own storage.
To show the default location where table data or partitions will be created if you create them through default HIVE commands: (
insert overwrite ... partition ...
and such):To show the actual location of a particular partition within a HIVE table, instead do this:
If you look in your filesystem where a table "should" live, and you find no files there, it's very likely that the table is created (usually incrementally) by creating a new partition and pointing that partition at some other location. This is a great way of building tables from things like daily imports from third parties and such, which avoids having to copy the files around or storing them more than once in different places.