运行 Spark Notebook 的 Azure Synapse Pipeline 会生成随机错误

发布于 2025-01-10 20:27:46 字数 1711 浏览 3 评论 0原文

我正在 Azure Synapse Spark 笔记本中处理大约 19,710 个包含 IIS 日志文件的目录。每个目录中有3个IIS日志文件。 Notebook 读取目录中的 3 个文件，并将它们从分隔文本转换为 Parquet。没有分区。但有时我会无缘无故地收到以下两个错误。

{
    "errorCode": "2011",
    "message": "An error occurred while sending the request.",
    "failureType": "UserError",
    "target": "Call Convert IIS To Raw Data Parquet",
    "details": []
}

当我收到上述错误时，所有数据均已成功写入 Azure Data Lake Storage Gen2 中的相应文件夹。

有时我得到

{
    "errorCode": "6002",
    "message": "(3,17): error CS0234: The type or namespace name 'Spark' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?)\n(4,17): error CS0234: The type or namespace name 'Spark' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?)\n(12,13): error CS0103: The name 'spark' does not exist in the current context",
    "failureType": "UserError",
    "target": "Call Convert IIS To Raw Data Parquet",
    "details": []
}

当我收到上述错误时，没有任何数据成功写入 Azure Data Lake Storage Gen2 中的相应文件夹。

在这两种情况下，您都可以看到笔记本确实运行了一段时间。我在 Spark 笔记本上启用了 1 次重试，它是一个 pyspark 笔记本，它使用 C# %%csharp 对参数执行 python 处理，其余逻辑则使用 C# %%csharp。 Spark 池很小（4 核/32GB），有 5 个节点。

笔记本中进行的唯一转换是将字符串列转换为时间戳。

var dfConverted = dfparquetTemp.WithColumn("Timestamp",Col("Timestamp").Cast("timestamp"));

当我说这是随机的时，管道当前正在运行，在处理 215 个目录后，有 2 个第一个失败，还有一个第二个失败。

任何想法或建议将不胜感激。

原文

I am processing approximately 19,710 directories containing IIS log files in an Azure Synapse Spark notebook. There are 3 IIS log files in each directory. The notebook reads the 3 files located in the directory and converts them from text delimited to Parquet. No partitioning. But occasionally I get the following two errors for no apparent reason.

{
    "errorCode": "2011",
    "message": "An error occurred while sending the request.",
    "failureType": "UserError",
    "target": "Call Convert IIS To Raw Data Parquet",
    "details": []
}

When I get the error above all of the data was successfully written to the appropriate folder in Azure Data Lake Storage Gen2.

sometimes I get

{
    "errorCode": "6002",
    "message": "(3,17): error CS0234: The type or namespace name 'Spark' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?)\n(4,17): error CS0234: The type or namespace name 'Spark' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?)\n(12,13): error CS0103: The name 'spark' does not exist in the current context",
    "failureType": "UserError",
    "target": "Call Convert IIS To Raw Data Parquet",
    "details": []
}

When I get the error above none of the data was successfully written to the appropriate folder in Azure Data Lake Storage Gen2.

In both cases you can see that the notebook did run for a period of time.
I have enabled 1 retry on the spark notebook, it is a pyspark notebook that does python for the parameters with the remainder of the logic using C# %%csharp. The spark pool is small (4 cores/ 32GB) with 5 nodes.

The only conversion going on in the notebook is converting a string column to a timestamp.

var dfConverted = dfparquetTemp.WithColumn("Timestamp",Col("Timestamp").Cast("timestamp"));

When I say this is random the pipeline is currently running and after processing 215 directories there are 2 of the first failure and one of the second.

Any ideas or suggestions would be appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夏尔 2025-01-17 20:27:46

好的，运行 113 小时后（几乎完成）我仍然收到以下错误，但看起来所有数据都已写出

Count 1

{
    "errorCode": "6002",
    "message": "(3,17): error CS0234: The type or namespace name 'Spark' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?)\n(4,17): error CS0234: The type or namespace name 'Spark' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?)\n(12,13): error CS0103: The name 'spark' does not exist in the current context",
    "failureType": "UserError",
    "target": "Call Convert IIS To Raw Data Parquet",
    "details": []
}

Count 1

{
    "errorCode": "6002",
    "message": "Exception: Failed to create Livy session for executing notebook. LivySessionId: 4419, Notebook: Convert IIS to Raw Data Parquet.\n--> LivyHttpRequestFailure: Something went wrong while processing your request. Please try again later. HTTP status code: 500. Trace ID: e0860852-40e6-498f-b2df-4eff9fee504a.",
    "failureType": "UserError",
    "target": "Call Convert IIS To Raw Data Parquet",
    "details": []
}

Count 17

{
    "errorCode": "2011",
    "message": "An error occurred while sending the request.",
    "failureType": "UserError",
    "target": "Call Convert IIS To Raw Data Parquet",
    "details": []
}

不确定这些错误是关于什么的，当然我会重新运行特定的管道中的数据，以查看这是一次性的还是在该特定数据上持续发生。但这些错误似乎是在数据写入镶木地板格式之后发生的。

OK after running for 113 hours (its almost done) I am still getting the following errors but it looks like all of the data was written out

Count 1

{
    "errorCode": "6002",
    "message": "(3,17): error CS0234: The type or namespace name 'Spark' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?)\n(4,17): error CS0234: The type or namespace name 'Spark' does not exist in the namespace 'Microsoft' (are you missing an assembly reference?)\n(12,13): error CS0103: The name 'spark' does not exist in the current context",
    "failureType": "UserError",
    "target": "Call Convert IIS To Raw Data Parquet",
    "details": []
}

Count 1

{
    "errorCode": "6002",
    "message": "Exception: Failed to create Livy session for executing notebook. LivySessionId: 4419, Notebook: Convert IIS to Raw Data Parquet.\n--> LivyHttpRequestFailure: Something went wrong while processing your request. Please try again later. HTTP status code: 500. Trace ID: e0860852-40e6-498f-b2df-4eff9fee504a.",
    "failureType": "UserError",
    "target": "Call Convert IIS To Raw Data Parquet",
    "details": []
}

Count 17

{
    "errorCode": "2011",
    "message": "An error occurred while sending the request.",
    "failureType": "UserError",
    "target": "Call Convert IIS To Raw Data Parquet",
    "details": []
}

Not sure what these errors are about and of course I will rerun the specific data in the pipeline to see if this is a one-off or keeps occurring on this specific data. But it seems as if these errors or occurring after the data as been written to parquet format.

回复收藏 0 原文

舞袖。长 2025-01-17 20:27:46

嗯，我认为这是问题的一部分。请记住，我正在用 C# 编写逻辑的主要部分，因此您使用其他语言的情况可能会有所不同。此外，这些是以空格分隔的 IIS 日志文件，它们的大小可以是多个兆字节，例如一个文件可以是 30MB。

我的新代码已经运行了 17 个小时，没有出现任何错误。我所做的所有更改都是为了确保我处理掉会消耗内存的资源。示例如下：

当将文本分隔文件作为二进制文件读取时，

    var df = spark.Read().Format("binaryFile").Option("inferSchema", false).Load(sourceFile) ;            
    byte[] rawData = df.First().GetAs<byte[]>("content");

byte[] 中的数据最终会加载到 List 中，但我从未将变量 rawData 设置为 null。

从上面的数据帧填充 byte[] 后，我添加了

    df.Unpersist() ;

After full put all data into List; rows 来自 byte[] 并使用下面的代码将其添加到数据框中，我清除了 rows 变量。

    var dfparquetTemp = spark.CreateDataFrame(rows,inputSchema);
    rows.Clear() ;

最后，在更改列类型并写出数据后，我对数据框进行了取消保留。

    var dfConverted = dfparquetTemp.WithColumn("Timestamp",Col("Timestamp").Cast("timestamp"));
    if(overwrite) {
        dfConverted.Write().Mode(SaveMode.Overwrite).Parquet(targetFile) ;
    }
    else {
        dfConverted.Write().Mode(SaveMode.Append).Parquet(targetFile) ;
    }
    dfConverted.Unpersist() ;

最后，我将大部分逻辑放在 C# 方法中，该方法在 foreach 循环中调用，希望 CLR 能够处理我错过的任何其他内容。

最后但并非最不重要的一个教训是。

当读取包含多个 parquet 文件的目录时，似乎
该火花将所有文件读取到数据框中。
当读取包含多个文本分隔文件的目录时
视为二进制文件，spark 仅将其中一个文件读取到
数据框。

因此，为了处理文件夹中的多个文本分隔文件，我必须传入多个文件的名称，并使用 SaveMode.Overwrite 处理第一个文件，将其他文件处理为 SaveMode.Append。尝试使用任何类型的通配符并指定目录名称的每种方法都只会导致将一个文件读入数据帧。（相信我，在使用 GoogleFu 几个小时后，我尝试了所有能找到的方法。）

又花了 17 个小时来处理没有一个错误，所以一个重要的教训似乎是保持内存使用量尽可能低。

Well I think this is part of the issue. Keep in mind that I am writing the main part of the logic in C# so your mileage in another language may vary. Also these are IIS log files that are space delimited and they can be multiple megabytes in size like one file could be 30MB.

My new code has been running for 17 hours without a single error. All of the changes I made were to ensure that I disposed of resources that would consume memory. Examples follow:

When reading a text delimited file as a binary file

    var df = spark.Read().Format("binaryFile").Option("inferSchema", false).Load(sourceFile) ;            
    byte[] rawData = df.First().GetAs<byte[]>("content");

the data in the byte[] eventually gets loaded into a List<GenericRow> but I never set the variable rawData to null.

After filling the byte[] from data frame above I added

    df.Unpersist() ;

After fully putting all data into List<GenericRow> rows from the byte[] and adding it into a data frame using the code below I cleared out the rows variable.

    var dfparquetTemp = spark.CreateDataFrame(rows,inputSchema);
    rows.Clear() ;

finally after changing a column type and writing out the data I did an unpersist on the data frame.

    var dfConverted = dfparquetTemp.WithColumn("Timestamp",Col("Timestamp").Cast("timestamp"));
    if(overwrite) {
        dfConverted.Write().Mode(SaveMode.Overwrite).Parquet(targetFile) ;
    }
    else {
        dfConverted.Write().Mode(SaveMode.Append).Parquet(targetFile) ;
    }
    dfConverted.Unpersist() ;

finally I have most of my logic inside of a C# method that gets called in a foreach loop with the hopes that the CLR will dispose of anything else I missed.

And last but not least a lesson learned.

When reading a directory containing multiple parquet files it seems
that spark reads all of the files into the data frame.
When reading a directory containing multiple text delimited files that you are
treating as binary files spark reads only ONE of the files into the
data frame.

So in order to process multiple text delimited files out of a folder I had to pass in the names of the multiple files and process the first file with an SaveMode.Overwrite and the other files as SaveMode.Append. Every method of attempting to use any kind of wild card and specifying the directory name only ever resulted in reading one file into the data frame. (Trust me here after hours of GoogleFu I tried every method I could find.)

Again 17 hours into processing not one single error so one important lesson seems to be to keep your memory usage as low as possible.

回复收藏 0 原文

梦途 2025-01-17 20:27:46

好的，我正在添加另一个答案，而不是编辑现有的答案。 113 小时后，我遇到了 52 个错误，必须重新处理。我发现一些错误是由于 Kryo 序列化失败：缓冲区溢出造成的。可用：0，必需：19938070。为了避免这种情况，请在 GoogleFu 几个小时后增加 Spark.kryoserializer.buffer.max ，其中还包括将我的 Spark 池的大小从小到中增加（没有效果）我将其添加为笔记本中的第一个单元，

%%configure
{
    "conf":
    {
        "spark.kryoserializer.buffer.max" : "512"
    }
}

因此这解决了 Kryo 序列化失败的问题，并且我相信更大的 Spark 池已修复了所有剩余的错误，因为它们现在都已成功处理。此外，之前运行 2 小时后失败的作业现在将在 30 分钟后完成。我怀疑这种速度的提高是由于火花池内存更大。所以吸取了教训。不要将小型池用于 IIS 文件。

终于有件事让我烦恼了。当您在空单元格中键入 %%configure 时，微软毫无帮助地输入了以下废话，

%%configure
{
    # You can get a list of valid parameters to config the session from https://github.com/cloudera/livy#request-body.
    "driverMemory": "28g", # Recommended values: ["28g", "56g", "112g", "224g", "400g", "472g"]
    "driverCores": 4, # Recommended values: [4, 8, 16, 32, 64, 80]
    "executorMemory": "28g",
    "executorCores": 4,
    "jars": ["abfs[s]: //<file_system>@<account_name>.dfs.core.windows.net/<path>/myjar.jar", "wasb[s]: //<containername>@<accountname>.blob.core.windows.net/<path>/myjar1.jar"],
    "conf":
    {
        # Example of standard spark property, to find more available properties please visit: https://spark.apache.org/docs/latest/configuration.html#application-properties.
        "spark.driver.maxResultSize": "10g",
        # Example of customized property, you can specify count of lines that Spark SQL returns by configuring "livy.rsc.sql.num-rows".
        "livy.rsc.sql.num-rows": "3000"
    }
}

我称其为废话，因为其中有注释。如果您尝试仅添加您想要的一项设置，则会由于注释而失败。请注意。

OK I am adding another answer rather than editing the existing ones. After 113 hours I had 52 errors that I had to reprocess. I found that some of the errors were due to Kryo serialization failed: Buffer overflow. Available: 0, required: 19938070. To avoid this, increase spark.kryoserializer.buffer.max well after a few hours of GoogleFu which also included increasing the size of my spark pool from small to medium (had no effect) I added this as the first cell in my notebook

%%configure
{
    "conf":
    {
        "spark.kryoserializer.buffer.max" : "512"
    }
}

So this fixed the Kryo serialization failed issue and I believe that the larger spark pool has fixed all of the remaining errors because they are now all processing successfully. Also jobs that previously failed after taking 2 hours to run are now completing after 30 minutes. I suspect this speed increase is due to the larger spark pool memory. So lesson learned. Do not use the small pool for IIS files.

Finally something that bugged me. when you type %%configure into an empty cell Microsoft so unhelpfully puts in the following crap

%%configure
{
    # You can get a list of valid parameters to config the session from https://github.com/cloudera/livy#request-body.
    "driverMemory": "28g", # Recommended values: ["28g", "56g", "112g", "224g", "400g", "472g"]
    "driverCores": 4, # Recommended values: [4, 8, 16, 32, 64, 80]
    "executorMemory": "28g",
    "executorCores": 4,
    "jars": ["abfs[s]: //<file_system>@<account_name>.dfs.core.windows.net/<path>/myjar.jar", "wasb[s]: //<containername>@<accountname>.blob.core.windows.net/<path>/myjar1.jar"],
    "conf":
    {
        # Example of standard spark property, to find more available properties please visit: https://spark.apache.org/docs/latest/configuration.html#application-properties.
        "spark.driver.maxResultSize": "10g",
        # Example of customized property, you can specify count of lines that Spark SQL returns by configuring "livy.rsc.sql.num-rows".
        "livy.rsc.sql.num-rows": "3000"
    }
}

I call it crap because IT HAS COMMENTS IN IT. If you try and just add in the one setting you want it will fail due to the comments. JUST BE WARNED.

回复收藏 0 原文

~没有更多了~