如何在foreachBatch函数中打印/日志输出?
使用表流,我正在尝试使用ForeachBatch
df.writestream
.format("delta")
.foreachBatch(WriteStreamToDelta)
...
WritestreamTodelta编写流,看起来
def WriteStreamToDelta(microDF, batch_id):
microDFWrangled = microDF."some_transformations"
print(microDFWrangled.count()) <-- How do I achieve the equivalence of this?
microDFWrangled.writeStream...
我想查看
- 笔记本中的行数,下面是Writestream cell
- 驱动程序日志
- 创建一个列表,以补充每个微批次的行数。
Using table streaming, I am trying to write stream using foreachBatch
df.writestream
.format("delta")
.foreachBatch(WriteStreamToDelta)
...
WriteStreamToDelta looks like
def WriteStreamToDelta(microDF, batch_id):
microDFWrangled = microDF."some_transformations"
print(microDFWrangled.count()) <-- How do I achieve the equivalence of this?
microDFWrangled.writeStream...
I would like to view the number of rows in
- Notebook, below the writeStream cell
- Driver Log
- Create a list to append number of rows for each micro batch.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在每个Microbatch中运行
microdfwrangled.count()
将有点贵。我相信更有效的是涉及streamQueryListener
,可以将输出发送到控制台,驱动程序日志,外部数据库等。streamingquerylistener是有效的,因为它使用了内部流统计,因此无需运行额外计算只是为了获得记录计数。
但是,如果在Pyspark的情况下,此功能在开始11.0的数据链球手中起作用。
在OSS Spark中,我认为它仅是因为最新版本
参考: https://www.databricks.com/blog/blog/2022/205/27/how-to-monitor-te-monitor-treaming-ceries-streaming-queries-ingeries-ingeries-in-pyspark.html
-pyspark.html“ rel =“ nofollow noreferrer 您仍然想使用print()发送输出,请考虑添加
.awaittermination()
作为最后一个链式语句:Running
microDFWrangled.count()
in each microbatch is going to be a bit expensive thing. I believe much more efficient is to involveStreamingQueryListener
which can send output to the console, to the driver logs, to external database etc.StreamingQueryListener is efficient because it uses internal streaming statistics, so no need to run extra computation just to get the record count.
However, in case of PySpark, this feature works in Databricks starting 11.0.
And in OSS spark I think it is available only since the latest releases
Reference: https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html
If you still want to send the output using print(), consider to add
.awaitTermination()
as the last chained statement: