更新AWS胶水表时,可以自动更新分区元数据

发布于 2025-02-11 05:37:17 字数 3061 浏览 3 评论 0原文

我有一个分区的S3桶。我正在使用AWS Athena阅读分区中的数据。我通过云形式堆栈创建了雅典娜中使用的AWS胶水表。 如果我更新云形式

S3ServerAccessLogsTable:
Type: AWS::Glue::Table
DependsOn: S3ServerAccessLogsDatabase
Properties:
    CatalogId: !Ref AWS::AccountId
    DatabaseName: 
      Fn::ImportValue: S3ServerAccessLogsDatabase
    TableInput:
      Name: s3_server_access_logs
      Description: !Sub
        - AWS GLUE table for viewing server access logs in ${S3Bucket}
        - S3Bucket: !Ref BucketName
      TableType: EXTERNAL_TABLE
      PartitionKeys:
        - Name: bucket
          Type: string
      StorageDescriptor:
        Columns:
          - Name: bucket_owner
            Type: string
          - Name: bucket_name
            Type: string
          - Name: request_date_time
            Type: string
          - Name: remote_ip
            Type: string
          - Name: requester
            Type: string
          - Name: request_id
            Type: string
          - Name: operation
            Type: string
          - Name: key
            Type: string
          - Name: request_uri_operation
            Type: string
          - Name: request_uri_key
            Type: string
          - Name: request_uri_httpProtoversion
            Type: string
          - Name: http_status
            Type: string
          - Name: error_code
            Type: string
          - Name: bytes_sent
            Type: bigint
          - Name: object_size
            Type: bigint
          - Name: total_time
            Type: string
          - Name: turnaround_time
            Type: string
          - Name: referrer
            Type: string
          - Name: user_agent
            Type: string
          - Name: version_id
            Type: string
          - Name: host_id
            Type: string
          - Name: sig_v
            Type: string
          - Name: cipher_suite
            Type: string
          - Name: auth_type
            Type: string
          - Name: end_point
            Type: string
          - Name: tls_version
            Type: string
        InputFormat: org.apache.hadoop.mapred.TextInputFormat
        OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
        Location: !Sub
          - s3://${S3LoggingBucket}/s3-server-access-logs/
          - S3LoggingBucket: !Ref S3ServerAccessLogsBucket
        SerdeInfo:
          SerializationLibrary: org.apache.hadoop.hive.serde2.RegexSerDe
          Parameters:
            {
              "serialization.format": "1",
              "input.regex": '([^ ]*) ([^ ]*) \[(.*?)\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ("[^"]*") ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$'
            }

模板中的AWS胶水表,然后更新堆栈,我现在需要运行命令MSCK Repair Table; table_name&gt ;;因此,分区元数据将被映射到桌子上。如果我不执行此操作,那么当我查询表时,没有可见的数据。

问题在于,我一直忘记运行MSCK维修表< table_name&gt ;;命令以及其他不使用该过程的其他人肯定会忘记。

因此,我想知道是否有一种自动处理方法来处理此操作,以便在更新AWS胶水表时自动重新加载分区元数据?

I have a partitioned S3 bucket. I am reading the data from the partitions using AWS Athena. I create the AWS Glue Table used in Athena via a CloudFormation stack. The relevant part of the stack is shown here

S3ServerAccessLogsTable:
Type: AWS::Glue::Table
DependsOn: S3ServerAccessLogsDatabase
Properties:
    CatalogId: !Ref AWS::AccountId
    DatabaseName: 
      Fn::ImportValue: S3ServerAccessLogsDatabase
    TableInput:
      Name: s3_server_access_logs
      Description: !Sub
        - AWS GLUE table for viewing server access logs in ${S3Bucket}
        - S3Bucket: !Ref BucketName
      TableType: EXTERNAL_TABLE
      PartitionKeys:
        - Name: bucket
          Type: string
      StorageDescriptor:
        Columns:
          - Name: bucket_owner
            Type: string
          - Name: bucket_name
            Type: string
          - Name: request_date_time
            Type: string
          - Name: remote_ip
            Type: string
          - Name: requester
            Type: string
          - Name: request_id
            Type: string
          - Name: operation
            Type: string
          - Name: key
            Type: string
          - Name: request_uri_operation
            Type: string
          - Name: request_uri_key
            Type: string
          - Name: request_uri_httpProtoversion
            Type: string
          - Name: http_status
            Type: string
          - Name: error_code
            Type: string
          - Name: bytes_sent
            Type: bigint
          - Name: object_size
            Type: bigint
          - Name: total_time
            Type: string
          - Name: turnaround_time
            Type: string
          - Name: referrer
            Type: string
          - Name: user_agent
            Type: string
          - Name: version_id
            Type: string
          - Name: host_id
            Type: string
          - Name: sig_v
            Type: string
          - Name: cipher_suite
            Type: string
          - Name: auth_type
            Type: string
          - Name: end_point
            Type: string
          - Name: tls_version
            Type: string
        InputFormat: org.apache.hadoop.mapred.TextInputFormat
        OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
        Location: !Sub
          - s3://${S3LoggingBucket}/s3-server-access-logs/
          - S3LoggingBucket: !Ref S3ServerAccessLogsBucket
        SerdeInfo:
          SerializationLibrary: org.apache.hadoop.hive.serde2.RegexSerDe
          Parameters:
            {
              "serialization.format": "1",
              "input.regex": '([^ ]*) ([^ ]*) \[(.*?)\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ("[^"]*") ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*

If I update the AWS Glue Table in the CloudFormation template and then update the stack, I now need to run the command MSCK REPAIR TABLE <table_name>; in Athena so that the partition metadata will be mapped to the table. If I don't do this last step, there is no data visible when I query the table.

The problem is that I keep forgetting to run the MSCK REPAIR TABLE <table_name>; command and others who are not used to the process will definitely forget.

So I'm wondering if there is an automated way to handle this, so that the partition metadata is automatically reloaded when the AWS Glue table is updated?

}

If I update the AWS Glue Table in the CloudFormation template and then update the stack, I now need to run the command MSCK REPAIR TABLE <table_name>; in Athena so that the partition metadata will be mapped to the table. If I don't do this last step, there is no data visible when I query the table.

The problem is that I keep forgetting to run the MSCK REPAIR TABLE <table_name>; command and others who are not used to the process will definitely forget.

So I'm wondering if there is an automated way to handle this, so that the partition metadata is automatically reloaded when the AWS Glue table is updated?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

鹤仙姿 2025-02-18 05:37:17

您是否查看过分区投影?您也可以将分区投影信息添加到CloudFormation。

Have you looked into partition projection? You can add the partition projection information to CloudFormation as well.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文