将数据从 Hive 导出到 AWS Redshift

发布于 2025-01-13 17:47:58 字数 474 浏览 4 评论 0原文

我正在尝试使用 hive -e 导出 1TB 的 hive 数据，因为我们没有访问 hdfs 文件系统并将数据加载到 Redshift 的选项。数据已导出为多个小文件，例如 30000 多个小 PARQUET 文件，数据总计达 1TB。要将数据加载到 redshift 中，它会抛出错误

String contains invalid or unsupported UTF8 codepoints. Bad UTF8 hex sequence: e9 (error 2)

已尝试的选项：

ACCEPTINVCHARS - 不适用于 parquet 格式
尝试使用 Athena -> 加载胶爬器->红移。这不是简单的解决方案，因为我们必须在 hive 中的 40 多个表中执行相同的操作。

如何构建管道以将数据从 Hive 复制并加载到 Redshift 中。也可以跳过 S3 加载。

原文

I'm trying to export 1TB of hive data using hive -e as we dont have option to access hdfs file system and load the data to Redshift . The data has been exported in multiple small files like 30000+ small PARQUET files sums upto 1TB of data. To load the data into redshift it is throwing a error

String contains invalid or unsupported UTF8 codepoints. Bad UTF8 hex sequence: e9 (error 2)

Options Tried:

ACCEPTINVCHARS -- which is not available for parquet format
Try to load using Athena -> Glue cralwer -> Redshift . Not straightforward solution as we have to do the same in 40+ tables in hive.

How to build a pipeline to copy the data from Hive and load into Redshift . S3 load also can be skipped.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

天赋异禀 2025-01-20 17:47:58

由于我不是 Hive 专家，所以我一直没有回答。问题是文件的字符编码。 Redshift 使用多字节 UTF8（就像大多数互联网一样），并且这些文件的编码方式不同（可能来自 Windows 的 UTF16，但这只是猜测）。我相信 Hive 可以在这两种字符集上运行（通过配置 SerDe，但我也不是 Hive 专家）。我不知道 Hive 是否可以以一种编码读取并以另一种编码导出。

当我使用 Hive 时，它已将输入编码保留到输出。因此，一种选择是将文件编码从提供 Hive 的源系统更改为 UTF8。过去，我从 mySQL 中完成此操作 - 以 UTF8 格式从 mySQL 导出并通过 Hive 馈送到 Redshift。这是最简单的方法，因为它只是配置已经存在的步骤。

另一种方法是将文件从一种编码转换为另一种编码。 Linux 命令 iconv 可以执行此操作，或者您可以为 Lambda 编写一些代码。此步骤可以插入到 Hive 之前或之后。您需要知道文件 BOM 中应包含的当前文件编码。您可以使用 Linux 命令“file”来读取此内容。

正如我上面所说，如果 Hive 可以进行转换，那就太好了。我只是不知道它是否能做到这一点。

底线 - 问题在于 Hive 运行的文件编码。 Redshift 需要将这些更改为 UTF8。这可以使用转换工具在源系统上完成，也可以在 Hive 中完成。

如果您想了解有关该主题的更多信息，请参阅：https ://github.com/boostcon/cppnow_presentations_2014/blob/master/files/unicode-cpp.pdf

回复收藏 0 原文

~没有更多了~