更新AWS胶水表时，可以自动更新分区元数据

发布于 2025-02-11 05:37:17 字数 3061 浏览 3 评论 0原文

我有一个分区的S3桶。我正在使用AWS Athena阅读分区中的数据。我通过云形式堆栈创建了雅典娜中使用的AWS胶水表。如果我更新云形式

S3ServerAccessLogsTable:
Type: AWS::Glue::Table
DependsOn: S3ServerAccessLogsDatabase
Properties:
    CatalogId: !Ref AWS::AccountId
    DatabaseName: 
      Fn::ImportValue: S3ServerAccessLogsDatabase
    TableInput:
      Name: s3_server_access_logs
      Description: !Sub
        - AWS GLUE table for viewing server access logs in ${S3Bucket}
        - S3Bucket: !Ref BucketName
      TableType: EXTERNAL_TABLE
      PartitionKeys:
        - Name: bucket
          Type: string
      StorageDescriptor:
        Columns:
          - Name: bucket_owner
            Type: string
          - Name: bucket_name
            Type: string
          - Name: request_date_time
            Type: string
          - Name: remote_ip
            Type: string
          - Name: requester
            Type: string
          - Name: request_id
            Type: string
          - Name: operation
            Type: string
          - Name: key
            Type: string
          - Name: request_uri_operation
            Type: string
          - Name: request_uri_key
            Type: string
          - Name: request_uri_httpProtoversion
            Type: string
          - Name: http_status
            Type: string
          - Name: error_code
            Type: string
          - Name: bytes_sent
            Type: bigint
          - Name: object_size
            Type: bigint
          - Name: total_time
            Type: string
          - Name: turnaround_time
            Type: string
          - Name: referrer
            Type: string
          - Name: user_agent
            Type: string
          - Name: version_id
            Type: string
          - Name: host_id
            Type: string
          - Name: sig_v
            Type: string
          - Name: cipher_suite
            Type: string
          - Name: auth_type
            Type: string
          - Name: end_point
            Type: string
          - Name: tls_version
            Type: string
        InputFormat: org.apache.hadoop.mapred.TextInputFormat
        OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
        Location: !Sub
          - s3://${S3LoggingBucket}/s3-server-access-logs/
          - S3LoggingBucket: !Ref S3ServerAccessLogsBucket
        SerdeInfo:
          SerializationLibrary: org.apache.hadoop.hive.serde2.RegexSerDe
          Parameters:
            {
              "serialization.format": "1",
              "input.regex": '([^ ]*) ([^ ]*) \[(.*?)\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ("[^"]*") ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$'
            }

模板中的AWS胶水表，然后更新堆栈，我现在需要运行命令MSCK Repair Table; table_name＆gt ;;因此，分区元数据将被映射到桌子上。如果我不执行此操作，那么当我查询表时，没有可见的数据。

问题在于，我一直忘记运行MSCK维修表＆lt; table_name＆gt ;;命令以及其他不使用该过程的其他人肯定会忘记。

因此，我想知道是否有一种自动处理方法来处理此操作，以便在更新AWS胶水表时自动重新加载分区元数据？

原文

I have a partitioned S3 bucket. I am reading the data from the partitions using AWS Athena. I create the AWS Glue Table used in Athena via a CloudFormation stack. The relevant part of the stack is shown here

S3ServerAccessLogsTable:
Type: AWS::Glue::Table
DependsOn: S3ServerAccessLogsDatabase
Properties:
    CatalogId: !Ref AWS::AccountId
    DatabaseName: 
      Fn::ImportValue: S3ServerAccessLogsDatabase
    TableInput:
      Name: s3_server_access_logs
      Description: !Sub
        - AWS GLUE table for viewing server access logs in ${S3Bucket}
        - S3Bucket: !Ref BucketName
      TableType: EXTERNAL_TABLE
      PartitionKeys:
        - Name: bucket
          Type: string
      StorageDescriptor:
        Columns:
          - Name: bucket_owner
            Type: string
          - Name: bucket_name
            Type: string
          - Name: request_date_time
            Type: string
          - Name: remote_ip
            Type: string
          - Name: requester
            Type: string
          - Name: request_id
            Type: string
          - Name: operation
            Type: string
          - Name: key
            Type: string
          - Name: request_uri_operation
            Type: string
          - Name: request_uri_key
            Type: string
          - Name: request_uri_httpProtoversion
            Type: string
          - Name: http_status
            Type: string
          - Name: error_code
            Type: string
          - Name: bytes_sent
            Type: bigint
          - Name: object_size
            Type: bigint
          - Name: total_time
            Type: string
          - Name: turnaround_time
            Type: string
          - Name: referrer
            Type: string
          - Name: user_agent
            Type: string
          - Name: version_id
            Type: string
          - Name: host_id
            Type: string
          - Name: sig_v
            Type: string
          - Name: cipher_suite
            Type: string
          - Name: auth_type
            Type: string
          - Name: end_point
            Type: string
          - Name: tls_version
            Type: string
        InputFormat: org.apache.hadoop.mapred.TextInputFormat
        OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
        Location: !Sub
          - s3://${S3LoggingBucket}/s3-server-access-logs/
          - S3LoggingBucket: !Ref S3ServerAccessLogsBucket
        SerdeInfo:
          SerializationLibrary: org.apache.hadoop.hive.serde2.RegexSerDe
          Parameters:
            {
              "serialization.format": "1",
              "input.regex": '([^ ]*) ([^ ]*) \[(.*?)\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ("[^"]*") ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*
If I update the AWS Glue Table in the CloudFormation template and then update the stack, I now need to run the command MSCK REPAIR TABLE <table_name>; in Athena so that the partition metadata will be mapped to the table. If I don't do this last step, there is no data visible when I query the table.
The problem is that I keep forgetting to run the MSCK REPAIR TABLE <table_name>; command and others who are not used to the process will definitely forget.
So I'm wondering if there is an automated way to handle this, so that the partition metadata is automatically reloaded when the AWS Glue table is updated?

            }

If I update the AWS Glue Table in the CloudFormation template and then update the stack, I now need to run the command MSCK REPAIR TABLE <table_name>; in Athena so that the partition metadata will be mapped to the table. If I don't do this last step, there is no data visible when I query the table.

The problem is that I keep forgetting to run the MSCK REPAIR TABLE <table_name>; command and others who are not used to the process will definitely forget.

So I'm wondering if there is an automated way to handle this, so that the partition metadata is automatically reloaded when the AWS Glue table is updated?

分享到QQ

分享到微博