正确访问 Glue 中的数据目录表
我在 Athena 中创建了一个表,没有来自 S3 源的爬虫。它出现在我的数据目录中。然而,当我尝试通过 Glue ETL 中的 python 作业访问它时,它显示它没有列或任何数据。访问列时弹出以下错误:AttributeError: 'DataFrame' object has no attribute '
。
我试图按照粘合方式访问动态框架:
datasource = glueContext.create_dynamic_frame.from_catalog(
database="datacatalog_database",
table_name="table_name",
transformation_ctx="datasource"
)
print(f"Count: {datasource.count()}")
print(f"Schema: {datasource.schema()}")
上面的日志输出:Count:0
& 架构:StructType([], {})
,其中 Athena 表显示我有大约 800,000 行。
旁注:
- 相关 ETL 作业已附加
AWSGlueServiceRole
。 - 我也尝试了 Glue 可视化编辑器,它显示了相关的数据目录数据库/表,但遗憾的是,同样的错误。
I created a table in Athena without a crawler from S3 source. It is showing up in my datacatalog. However, when I try to access it through a python job in Glue ETL, it shows that it has no column or any data. The following error pops up when accessing a column: AttributeError: 'DataFrame' object has no attribute '<COLUMN-NAME>'
.
I am trying to access the dynamic frame following the glue way:
datasource = glueContext.create_dynamic_frame.from_catalog(
database="datacatalog_database",
table_name="table_name",
transformation_ctx="datasource"
)
print(f"Count: {datasource.count()}")
print(f"Schema: {datasource.schema()}")
The above logs output: Count: 0
& Schema: StructType([], {})
, where the Athena table shows I have around ~800,000 rows.
Sidenotes:
- The ETL job concerned has
AWSGlueServiceRole
attached. - I tried Glue Visual Editor as well, it showed the datacatalog database/table concerned but sadly, same error.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
看起来 S3 存储桶内有多个嵌套文件夹。为了让 Glue 读取这些文件夹,您需要添加一个标志,将
additional_options = {"recurse": True}
添加到您的 from_catalog() 中。这将有助于从 s3 文件中递归读取记录。It looks like the S3 bucket has multiple nested folders inside it. For Glue to read these folders you need to add a flag adding
additional_options = {"recurse": True}
to your from_catalog(). This will help to recursively read records from s3 files.