AWS Glue Crawler 使用 tsv 文件创建一个空表,但不使用分号分隔的文件创建空表
我有一个 AWS Glue Crawler,在 S3 上有 3 个数据存储,每个数据存储都是表的 S3 路径。当文件以分号分隔时,爬网程序运行良好,而当文件以制表符分隔时,爬网程序就会崩溃。 但是,根据 AWS 官方文档,CSV 内置分类器
检查以下分隔符:逗号 (,)、竖线 (|)、制表符 (\t)、分号 (;) 和 Ctrl-A (\u0001)。 Ctrl-A 是标题开始的 Unicode 控制字符。
让我提供更多细节。
S3 结构如下(全部在同一个存储桶内):
|--table_1
|---------|partion_a=1
|------------------|partion_b=2
|---------------------------|partion_c=3
|------------------------------------|partion_d=4
|-----------------------------------------------|file_1.csv
|--table_2
|---------|partion_a=1
|------------------|partion_b=2
|---------------------------|partion_c=3
|------------------------------------|partion_d=4
|-----------------------------------------------|file_2.csv
|--table_3
|---------|partion_a=1
|------------------|partion_b=2
|---------------------------|partion_c=3
|--------------------------------------|file_3a.csv
|---------------------------|partion_c=4
|--------------------------------------|file_3b.csv
|---------------------------|partion_c=5
|--------------------------------------|file_3c.csv
爬虫按预期使用 table_1 和 table_2 工作,即创建 2 个表,将分类标识为 csv,创建 3 个分区,检测标头。
虽然它对于 table_3 不能正常工作:
- 它确实在数据目录中创建了一个表;
- 它确实添加了分区(所有分区,即partition_c = 3、4 和5);
- 但是它没有检测到架构,即根本没有列
CloudWathc 日志中不会报告任何错误,而如果我在 Athena 上查询 table_3 (SELECT * FROM Table_3 LIMIT 10),则会收到以下错误:
"HIVE_UNKNOWN_ERROR: serDe should not be accessed from a null StorageFormat"
这些是表文件之间的主要区别:
- Table_1 文件很小,即大约 20 KB,并且以分号分隔;
- Table_2 文件比 Table_1 大但仍然很小,即大约 20 MB,并且以分号分隔;
- Table_3 文件要大得多,即大约 200 MB,并且以制表符分隔。
我尝试将 table_3 文件重命名为 .tsv 并重新运行爬网程序,但没有任何变化。 我还尝试对 table_3 使用单个较小的文件,即仅partition_c=3 且大小约为2MB,但没有任何变化。
您知道为什么会发生这种情况以及如何解决吗? 我应该只为 .tsv 文件创建自定义分类器吗?
I have an AWS Glue Crawler with 3 data stores on S3, each data store is the S3 path of a table. The crawler works well when the files are semicolon-separated, while it breaks down when they are tab-separated.
However, according to the AWS official documentation, the CSV built-in classifier
Checks for the following delimiters: comma (,), pipe (|), tab (\t), semicolon (;), and Ctrl-A (\u0001). Ctrl-A is the Unicode control character for Start Of Heading.
Let me provide more details.
The S3 structure is as follows (all within the same bucket):
|--table_1
|---------|partion_a=1
|------------------|partion_b=2
|---------------------------|partion_c=3
|------------------------------------|partion_d=4
|-----------------------------------------------|file_1.csv
|--table_2
|---------|partion_a=1
|------------------|partion_b=2
|---------------------------|partion_c=3
|------------------------------------|partion_d=4
|-----------------------------------------------|file_2.csv
|--table_3
|---------|partion_a=1
|------------------|partion_b=2
|---------------------------|partion_c=3
|--------------------------------------|file_3a.csv
|---------------------------|partion_c=4
|--------------------------------------|file_3b.csv
|---------------------------|partion_c=5
|--------------------------------------|file_3c.csv
The crawler works as expected with table_1 and table_2, i.e. it creates 2 tables, it identifies the classification as csv, it creates 3 partitions, it detects the header.
While it doesn't work properly for table_3:
- it does create a table in the data catalog;
- it does add the partitions (all of them, i.e. partition_c = 3, 4, and 5);
- however it does not detect the schema, i.e. no columns at all
No errors are reported in the CloudWathc logs, while if I query table_3 on Athena (SELECT * FROM Table_3 LIMIT 10) I get the following error:
"HIVE_UNKNOWN_ERROR: serDe should not be accessed from a null StorageFormat"
These are the main differences among the table files:
- Table_1 files are small, i.e. about 20 KB, and are semicolon-separated;
- Table_2 files larger than Table_1 but still small, i.e. about 20 MB, and are semicolon-separated;
- Table_3 files are much larger, i.e. about 200 MB, and are tab-separeted.
I have tried to rename table_3 files as .tsv and re-run the crawler, but nothing changed.
I have also tried using a single smaller file for table_3, i.e. only partition_c=3 and size about 2MB, but nothing changed.
Do you have any idea why this is happening and how to solve it?
Shall I create a custom classifier for the .tsv files only?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
为了创建 TSV 表而不是(逗号 (,)、竖线 (|)、制表符 (\t)、分号 (;) 和 Ctrl-A (\u0001)),您需要创建表和架构定义通过 AWS Athena 编辑器查询。
Athena 中负责读取和解析数据的组件称为 serde,是序列化器/反序列化器的缩写。如果您在创建 Athena 表时未指定任何其他内容,您将获得一个名为 LazySimpleSerDe 的 Serde,它是为分隔文本(例如 CSV)而设计的。它可以配置为不同的分隔符、转义字符和行结束符等。
定义创建表命令并提供表架构后,您需要添加以下内容:
在幕后,当您看到实际的表 DML 时,您将看到它如何使用以下正则表达式解析数据:
了解更多:
LazySimpleSerDe 适用于 CSV、TSV 和自定义 -分隔文件
使用 CSV
In order to to create a TSV table and not ( comma (,), pipe (|), tab (\t), semicolon (;), and Ctrl-A (\u0001)) you need to create the table and schema defintion via AWS Athena editor query.
The component in Athena that is responsible for reading and parsing data is called a serde, short for serializer/deserializer. If you don’t specify anything else when creating an Athena table you get a serde called LazySimpleSerDe, which was made for delimited text such as CSV. It can be configured for different delimiters, escape characters, and line endings, among other things.
After you define the create table command and supply the table schema ,you need to add the following :
Behind the scenes, when you see the actual table DML, you will see how it's parsing the data with the following regex:
Read more :
LazySimpleSerDe for CSV, TSV, and custom-delimited files
Working with CSV