更新AWS胶水表时,可以自动更新分区元数据
我有一个分区的S3桶。我正在使用AWS Athena阅读分区中的数据。我通过云形式堆栈创建了雅典娜中使用的AWS胶水表。 如果我更新云形式
S3ServerAccessLogsTable:
Type: AWS::Glue::Table
DependsOn: S3ServerAccessLogsDatabase
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName:
Fn::ImportValue: S3ServerAccessLogsDatabase
TableInput:
Name: s3_server_access_logs
Description: !Sub
- AWS GLUE table for viewing server access logs in ${S3Bucket}
- S3Bucket: !Ref BucketName
TableType: EXTERNAL_TABLE
PartitionKeys:
- Name: bucket
Type: string
StorageDescriptor:
Columns:
- Name: bucket_owner
Type: string
- Name: bucket_name
Type: string
- Name: request_date_time
Type: string
- Name: remote_ip
Type: string
- Name: requester
Type: string
- Name: request_id
Type: string
- Name: operation
Type: string
- Name: key
Type: string
- Name: request_uri_operation
Type: string
- Name: request_uri_key
Type: string
- Name: request_uri_httpProtoversion
Type: string
- Name: http_status
Type: string
- Name: error_code
Type: string
- Name: bytes_sent
Type: bigint
- Name: object_size
Type: bigint
- Name: total_time
Type: string
- Name: turnaround_time
Type: string
- Name: referrer
Type: string
- Name: user_agent
Type: string
- Name: version_id
Type: string
- Name: host_id
Type: string
- Name: sig_v
Type: string
- Name: cipher_suite
Type: string
- Name: auth_type
Type: string
- Name: end_point
Type: string
- Name: tls_version
Type: string
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Location: !Sub
- s3://${S3LoggingBucket}/s3-server-access-logs/
- S3LoggingBucket: !Ref S3ServerAccessLogsBucket
SerdeInfo:
SerializationLibrary: org.apache.hadoop.hive.serde2.RegexSerDe
Parameters:
{
"serialization.format": "1",
"input.regex": '([^ ]*) ([^ ]*) \[(.*?)\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ("[^"]*") ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$'
}
模板中的AWS胶水表,然后更新堆栈,我现在需要运行命令MSCK Repair Table; table_name&gt ;;因此,分区元数据将被映射到桌子上。如果我不执行此操作,那么当我查询表时,没有可见的数据。
问题在于,我一直忘记运行MSCK维修表< table_name&gt ;;
命令以及其他不使用该过程的其他人肯定会忘记。
因此,我想知道是否有一种自动处理方法来处理此操作,以便在更新AWS胶水表时自动重新加载分区元数据?
I have a partitioned S3 bucket. I am reading the data from the partitions using AWS Athena. I create the AWS Glue Table used in Athena via a CloudFormation stack. The relevant part of the stack is shown here
S3ServerAccessLogsTable:
Type: AWS::Glue::Table
DependsOn: S3ServerAccessLogsDatabase
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName:
Fn::ImportValue: S3ServerAccessLogsDatabase
TableInput:
Name: s3_server_access_logs
Description: !Sub
- AWS GLUE table for viewing server access logs in ${S3Bucket}
- S3Bucket: !Ref BucketName
TableType: EXTERNAL_TABLE
PartitionKeys:
- Name: bucket
Type: string
StorageDescriptor:
Columns:
- Name: bucket_owner
Type: string
- Name: bucket_name
Type: string
- Name: request_date_time
Type: string
- Name: remote_ip
Type: string
- Name: requester
Type: string
- Name: request_id
Type: string
- Name: operation
Type: string
- Name: key
Type: string
- Name: request_uri_operation
Type: string
- Name: request_uri_key
Type: string
- Name: request_uri_httpProtoversion
Type: string
- Name: http_status
Type: string
- Name: error_code
Type: string
- Name: bytes_sent
Type: bigint
- Name: object_size
Type: bigint
- Name: total_time
Type: string
- Name: turnaround_time
Type: string
- Name: referrer
Type: string
- Name: user_agent
Type: string
- Name: version_id
Type: string
- Name: host_id
Type: string
- Name: sig_v
Type: string
- Name: cipher_suite
Type: string
- Name: auth_type
Type: string
- Name: end_point
Type: string
- Name: tls_version
Type: string
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Location: !Sub
- s3://${S3LoggingBucket}/s3-server-access-logs/
- S3LoggingBucket: !Ref S3ServerAccessLogsBucket
SerdeInfo:
SerializationLibrary: org.apache.hadoop.hive.serde2.RegexSerDe
Parameters:
{
"serialization.format": "1",
"input.regex": '([^ ]*) ([^ ]*) \[(.*?)\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ("[^"]*") ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*
If I update the AWS Glue Table in the CloudFormation template and then update the stack, I now need to run the command MSCK REPAIR TABLE <table_name>;
in Athena so that the partition metadata will be mapped to the table. If I don't do this last step, there is no data visible when I query the table.
The problem is that I keep forgetting to run the MSCK REPAIR TABLE <table_name>;
command and others who are not used to the process will definitely forget.
So I'm wondering if there is an automated way to handle this, so that the partition metadata is automatically reloaded when the AWS Glue table is updated?
}
If I update the AWS Glue Table in the CloudFormation template and then update the stack, I now need to run the command MSCK REPAIR TABLE <table_name>;
in Athena so that the partition metadata will be mapped to the table. If I don't do this last step, there is no data visible when I query the table.
The problem is that I keep forgetting to run the MSCK REPAIR TABLE <table_name>;
command and others who are not used to the process will definitely forget.
So I'm wondering if there is an automated way to handle this, so that the partition metadata is automatically reloaded when the AWS Glue table is updated?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您是否查看过分区投影?您也可以将分区投影信息添加到CloudFormation。
Have you looked into partition projection? You can add the partition projection information to CloudFormation as well.