AWS胶：调用O100.PyWritedYnamicFrame时发生错误。无法找到数据来源：未知

发布于 2025-02-07 03:01:49 字数 1503 浏览 1 评论 0原文

运行胶水管道时会遇到以下

An error occurred while calling o100.pyWriteDynamicFrame. Failed to find data source: UNKNOWN. Please find packages at http://spark.apache.org/third-party-projects.html

错误

尝试

我在 java.lang.reflect.undeclaredthrowable Exception

我的脚本是：

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
    format_options={"multiline": False},
    connection_type="s3",
    format="json",
    connection_options={"paths": ["s3://numbeo-bucket/results.json"], "recurse": True},
    transformation_ctx="S3bucket_node1",
)

# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
    frame=S3bucket_node1, mappings=[], transformation_ctx="ApplyMapping_node2"
)

# Script generated for node Redshift Cluster
RedshiftCluster_node3 = glueContext.write_dynamic_frame.from_catalog(
    frame=ApplyMapping_node2,
    database="redshift-cluster-1",
    table_name="results_json",
    redshift_tmp_dir=args["TempDir"],
    transformation_ctx="RedshiftCluster_node3",
)

原文

I'm getting the following error when attempting to run a Glue pipeline that uploads a json stored in S3 to Redshift

An error occurred while calling o100.pyWriteDynamicFrame. Failed to find data source: UNKNOWN. Please find packages at http://spark.apache.org/third-party-projects.html

I have an outputted log file that includes the following errors in this order:

InvocationTargetException java.lang.reflect.InvocationTargetException

Exception in User Class java.lang.reflect.UndeclaredThrowableException

My script is the following:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
    format_options={"multiline": False},
    connection_type="s3",
    format="json",
    connection_options={"paths": ["s3://numbeo-bucket/results.json"], "recurse": True},
    transformation_ctx="S3bucket_node1",
)

# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
    frame=S3bucket_node1, mappings=[], transformation_ctx="ApplyMapping_node2"
)

# Script generated for node Redshift Cluster
RedshiftCluster_node3 = glueContext.write_dynamic_frame.from_catalog(
    frame=ApplyMapping_node2,
    database="redshift-cluster-1",
    table_name="results_json",
    redshift_tmp_dir=args["TempDir"],
    transformation_ctx="RedshiftCluster_node3",
)

分享到QQ

分享到微博