如何在foreachBatch函数中打印/日志输出?

发布于 2025-02-06 12:44:07 字数 507 浏览 3 评论 0原文

使用表流,我正在尝试使用ForeachBatch

df.writestream
.format("delta")
.foreachBatch(WriteStreamToDelta)
...

WritestreamTodelta编写流,看起来

def WriteStreamToDelta(microDF, batch_id):
   microDFWrangled = microDF."some_transformations"
   
   print(microDFWrangled.count()) <-- How do I achieve the equivalence of this?

   microDFWrangled.writeStream...

我想查看

  1. 笔记本中的行数,下面是Writestream cell
  2. 驱动程序日志
  3. 创建一个列表,以补充每个微批次的行数。

Using table streaming, I am trying to write stream using foreachBatch

df.writestream
.format("delta")
.foreachBatch(WriteStreamToDelta)
...

WriteStreamToDelta looks like

def WriteStreamToDelta(microDF, batch_id):
   microDFWrangled = microDF."some_transformations"
   
   print(microDFWrangled.count()) <-- How do I achieve the equivalence of this?

   microDFWrangled.writeStream...

I would like to view the number of rows in

  1. Notebook, below the writeStream cell
  2. Driver Log
  3. Create a list to append number of rows for each micro batch.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

怂人 2025-02-13 12:44:07

在每个Microbatch中运行microdfwrangled.count()将有点贵。我相信更有效的是涉及streamQueryListener,可以将输出发送到控制台,驱动程序日志,外部数据库等。streamingquerylistener

是有效的,因为它使用了内部流统计,因此无需运行额外计算只是为了获得记录计数。

但是,如果在Pyspark的情况下,此功能在开始11.0的数据链球手中起作用。
在OSS Spark中,我认为它仅是因为最新版本

参考: https://www.databricks.com/blog/blog/2022/205/27/how-to-monitor-te-monitor-treaming-ceries-streaming-queries-ingeries-ingeries-in-pyspark.html


-pyspark.html“ rel =“ nofollow noreferrer 您仍然想使用print()发送输出,请考虑添加.awaittermination()作为最后一个链式语句:

df.writestream
.format("delta")
.foreachBatch(WriteStreamToDelta)
.Start()
.awaitTermination()

Running microDFWrangled.count() in each microbatch is going to be a bit expensive thing. I believe much more efficient is to involve StreamingQueryListener which can send output to the console, to the driver logs, to external database etc.

StreamingQueryListener is efficient because it uses internal streaming statistics, so no need to run extra computation just to get the record count.

However, in case of PySpark, this feature works in Databricks starting 11.0.
And in OSS spark I think it is available only since the latest releases

Reference: https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html


If you still want to send the output using print(), consider to add .awaitTermination() as the last chained statement:

df.writestream
.format("delta")
.foreachBatch(WriteStreamToDelta)
.Start()
.awaitTermination()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文