在 Spark SQL 中显示在 Hive 表上创建表 - 将 CHAR、VARCHAR 视为字符串

发布于 2025-01-13 04:41:40 字数 788 浏览 2 评论 0原文

我需要为 Hive 表和生成 DDL 语句。以编程方式查看。我尝试使用 Spark 和 Beeline 来完成此任务。 Beeline 每个语句大约需要 5-10 秒，而 Spark 在几毫秒内完成同样的事情。我计划使用 Spark，因为它比直线更快。使用 Spark 从配置单元获取 DDL 语句的一个缺点是，它将 CHAR、VARCHAR 字符视为字符串，并且不保留 CHAR、VARCHAR 数据类型附带的长度信息。同时beeline保留了CHAR、VARCHAR数据类型的数据类型和长度信息。我正在使用 Spark 2.4.1 和 Beeline 2.1.1。

下面给出了示例 create table 命令及其 show create table 输出。

直线输出：

Spark-Shell:

我想知道 Spark 端是否有任何配置需要保留CHAR、VARCHAR 数据类型的数据类型和长度信息。如果有其他方法可以快速从 Hive 获取 DDL，我也可以接受。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ι不睡觉的鱼゛ 2025-01-20 04:41:41

这是在

Hive 3.1.1
Spark 3.1.1

您提出的堆栈溢出问题中，我引用：

“我需要以编程方式为 Hive 表和视图生成 DDL 语句。我尝试使用 Spark 和 Beeline 来完成此任务。Beeline 每个任务大约需要 5-10 秒语句，而 Spark 在几毫秒内完成同样的事情，因为它比 beeline 更快，使用 Spark 从 hive 获取 DDL 语句的一个缺点是，它处理 CHAR， VARCHAR 字符作为字符串，它不保留 CHAR、VARCHAR 数据类型的长度信息，同时 beeline 保留 CHAR、VARCHAR 数据类型的数据类型和长度信息。 Beeline 2.1.1。下面给出了示例创建表命令及其显示创建表输出。”

在测试数据库的 Hive 中创建一个简单的表现

hive> use test;
OK
hive> create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);
OK
hive> desc formatted etc;
# col_name              data_type               comment
id                      bigint
col1                    varchar(30)
col2                    string

# Detailed Table Information
Database:               test
OwnerType:              USER
Owner:                  hduser
CreateTime:             Fri Mar 11 18:29:34 GMT 2022
LastAccessTime:         UNKNOWN
Retention:              0
Location:               hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
        bucketing_version       2
        numFiles                0
        numRows                 0
        rawDataSize             0
        totalSize               0
        transient_lastDdlTime   1647023374

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        serialization.format    1

在让我们转到 Spark-Shell

scala> spark.sql("show create table test.etc").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt                                                                                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `test`.`etc` (
  `id` BIGINT,
  `col1` VARCHAR(30),
  `col2` STRING)
USING text
TBLPROPERTIES (
  'bucketing_version' = '2',
  'transient_lastDdlTime' = '1647023374')
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

您可以看到 Spark 正确显示列

现在让我们通过 beeline 在 Hive 中创建相同的表现


0: jdbc:hive2://rhes75:10099/default> use test

No rows affected (0.019 seconds)


0: jdbc:hive2://rhes75:10099/default> create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING)

. . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)

0: jdbc:hive2://rhes75:10099/default> desc formatted etc

. . . . . . . . . . . . . . . . . . > +-------------------------------+----------------------------------------------------+----------------------------------------------------+

|           col_name            |                     data_type                      |                      comment                       |

+-------------------------------+----------------------------------------------------+----------------------------------------------------+

| # col_name                    | data_type                                          | comment                                            |

| id                            | bigint                                             |                                                    |

| col1                          | varchar(30)                                        |                                                    |

| col2                          | string                                             |                                                    |

|                               | NULL                                               | NULL                                               |

| # Detailed Table Information  | NULL                                               | NULL                                               |

| Database:                     | test                                               | NULL                                               |

| OwnerType:                    | USER                                               | NULL                                               |

| Owner:                        | hduser                                             | NULL                                               |

| CreateTime:                   | Fri Mar 11 18:51:00 GMT 2022                       | NULL                                               |

| LastAccessTime:               | UNKNOWN                                            | NULL                                               |

| Retention:                    | 0                                                  | NULL                                               |

| Location:                     | hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL                                               |

| Table Type:                   | MANAGED_TABLE                                      | NULL                                               |

| Table Parameters:             | NULL                                               | NULL                                               |

|                               | COLUMN_STATS_ACCURATE                              | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}} |

|                               | bucketing_version                                  | 2                                                  |

|                               | numFiles                                           | 0                                                  |

|                               | numRows                                            | 0                                                  |

|                               | rawDataSize                                        | 0                                                  |

|                               | totalSize                                          | 0                                                  |

|                               | transient_lastDdlTime                              | 1647024660                                         |

|                               | NULL                                               | NULL                                               |

| # Storage Information         | NULL                                               | NULL                                               |

| SerDe Library:                | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL                                               |

| InputFormat:                  | org.apache.hadoop.mapred.TextInputFormat           | NULL                                               |

| OutputFormat:                 | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL                                               |

| Compressed:                   | No                                                 | NULL                                               |

| Num Buckets:                  | -1                                                 | NULL                                               |

| Bucket Columns:               | []                                                 | NULL                                               |

| Sort Columns:                 | []                                                 | NULL                                               |

| Storage Desc Params:          | NULL                                               | NULL                                               |

|                               | serialization.format                               | 1                                                  |

+-------------------------------+----------------------------------------------------+----------------------------------------------------+

33 rows selected (0.159 seconds)

在再次在 Spark-Shell 中检查

scala> spark.sql("show create table test.etc").show(false)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

|createtab_stmt                                                                                                                                                                      |

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

|CREATE TABLE `test`.`etc` (

  `id` BIGINT,

  `col1` VARCHAR(30),

  `col2` STRING)

USING text

TBLPROPERTIES (

  'bucketing_version' = '2',

  'transient_lastDdlTime' = '1647024660')

|

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

它显示 OK。总之，您可以在 Spark 中获得列定义，就像在 Hive 中定义它们一样。

在上面的声明中，我引用“我正在使用 Spark 2.4.1 和 Beeline 2.1.1”，指的是旧版本的 Spark 和 hive 可能存在此类问题。

This is in

Hive 3.1.1
Spark 3.1.1

Your stack overflow issue raised and I quote:

"I have a need to generate DDL statements for Hive tables & views programmatically. I tried using Spark and Beeline for this task. Beeline takes around 5-10 seconds for each of the statements whereas Spark completes the same thing in a few milliseconds. I am planning to use Spark since it is faster compared to beeline. One downside of using spark for getting DDL statements from the hive is, it treats CHAR, VARCHAR characters as String and it doesn't preserve the length information that goes with CHAR,VARCHAR data types. At the same time beeline preserves the data type and the length information for CHAR,VARCHAR data types. I am using Spark 2.4.1 and Beeline 2.1.1. Given below the sample create table command and its show create table output."

Create a simple table in Hive in test database

hive> use test;
OK
hive> create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);
OK
hive> desc formatted etc;
# col_name              data_type               comment
id                      bigint
col1                    varchar(30)
col2                    string

# Detailed Table Information
Database:               test
OwnerType:              USER
Owner:                  hduser
CreateTime:             Fri Mar 11 18:29:34 GMT 2022
LastAccessTime:         UNKNOWN
Retention:              0
Location:               hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
        bucketing_version       2
        numFiles                0
        numRows                 0
        rawDataSize             0
        totalSize               0
        transient_lastDdlTime   1647023374

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        serialization.format    1

Now let's go to spark-shell

scala> spark.sql("show create table test.etc").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt                                                                                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `test`.`etc` (
  `id` BIGINT,
  `col1` VARCHAR(30),
  `col2` STRING)
USING text
TBLPROPERTIES (
  'bucketing_version' = '2',
  'transient_lastDdlTime' = '1647023374')
|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

You can see Spark shows columns correctly

Now let us go and create the same table in hive through beeline


0: jdbc:hive2://rhes75:10099/default> use test

No rows affected (0.019 seconds)


0: jdbc:hive2://rhes75:10099/default> create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING)

. . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)

0: jdbc:hive2://rhes75:10099/default> desc formatted etc

. . . . . . . . . . . . . . . . . . > +-------------------------------+----------------------------------------------------+----------------------------------------------------+

|           col_name            |                     data_type                      |                      comment                       |

+-------------------------------+----------------------------------------------------+----------------------------------------------------+

| # col_name                    | data_type                                          | comment                                            |

| id                            | bigint                                             |                                                    |

| col1                          | varchar(30)                                        |                                                    |

| col2                          | string                                             |                                                    |

|                               | NULL                                               | NULL                                               |

| # Detailed Table Information  | NULL                                               | NULL                                               |

| Database:                     | test                                               | NULL                                               |

| OwnerType:                    | USER                                               | NULL                                               |

| Owner:                        | hduser                                             | NULL                                               |

| CreateTime:                   | Fri Mar 11 18:51:00 GMT 2022                       | NULL                                               |

| LastAccessTime:               | UNKNOWN                                            | NULL                                               |

| Retention:                    | 0                                                  | NULL                                               |

| Location:                     | hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL                                               |

| Table Type:                   | MANAGED_TABLE                                      | NULL                                               |

| Table Parameters:             | NULL                                               | NULL                                               |

|                               | COLUMN_STATS_ACCURATE                              | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}} |

|                               | bucketing_version                                  | 2                                                  |

|                               | numFiles                                           | 0                                                  |

|                               | numRows                                            | 0                                                  |

|                               | rawDataSize                                        | 0                                                  |

|                               | totalSize                                          | 0                                                  |

|                               | transient_lastDdlTime                              | 1647024660                                         |

|                               | NULL                                               | NULL                                               |

| # Storage Information         | NULL                                               | NULL                                               |

| SerDe Library:                | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL                                               |

| InputFormat:                  | org.apache.hadoop.mapred.TextInputFormat           | NULL                                               |

| OutputFormat:                 | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL                                               |

| Compressed:                   | No                                                 | NULL                                               |

| Num Buckets:                  | -1                                                 | NULL                                               |

| Bucket Columns:               | []                                                 | NULL                                               |

| Sort Columns:                 | []                                                 | NULL                                               |

| Storage Desc Params:          | NULL                                               | NULL                                               |

|                               | serialization.format                               | 1                                                  |

+-------------------------------+----------------------------------------------------+----------------------------------------------------+

33 rows selected (0.159 seconds)

Now check that in spark-shell again

scala> spark.sql("show create table test.etc").show(false)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

|createtab_stmt                                                                                                                                                                      |

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

|CREATE TABLE `test`.`etc` (

  `id` BIGINT,

  `col1` VARCHAR(30),

  `col2` STRING)

USING text

TBLPROPERTIES (

  'bucketing_version' = '2',

  'transient_lastDdlTime' = '1647024660')

|

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

It shows OK. So in summary you get column definitions in Spark as you have defined them in Hive.

In your statement above and I quote "I am using Spark 2.4.1 and Beeline 2.1.1", refers to older versions of Spark and hive which may have had such issues.

回复收藏 0 原文

~没有更多了~