Pyspark 流式处理 CPU 利用率低
我正在尝试使用 python 中的 Spark 流在流上运行一些处理。 该流每秒给出 2000 个数据点,并且处理应该每 4 秒给出一次结果 (slideDuration=4),但实际上它被延迟了很多(在大约 30 秒后打印每个结果)。
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
# Create a local StreamingContext with two working thread and batch interval of 1 second
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 2)
ssc.checkpoint("./checkpoints") # set checkpoint directory
lines = ssc.socketTextStream("localhost", 9000)
lines = lines.map(lambda x : (x, 1/40000))
lines = lines.reduceByKeyAndWindow(lambda a, b: a + b, lambda a, b: a - b, windowDuration=20, slideDuration=4)
lines = lines.filter(lambda x : x[1] > 0.03)
lines.pprint()
ssc.start()
ssc.awaitTermination()
但 CPU 利用率约为 40%,并且没有更高。
I'm trying to run some processing on a stream with Spark streaming in python.
The stream is giving out 2000 datapoints each second, and the processing is supposed to give results each 4 seconds (slideDuration=4), but in fact it is being delayed a lot (printing the each result after ~30 seconds).
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
# Create a local StreamingContext with two working thread and batch interval of 1 second
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 2)
ssc.checkpoint("./checkpoints") # set checkpoint directory
lines = ssc.socketTextStream("localhost", 9000)
lines = lines.map(lambda x : (x, 1/40000))
lines = lines.reduceByKeyAndWindow(lambda a, b: a + b, lambda a, b: a - b, windowDuration=20, slideDuration=4)
lines = lines.filter(lambda x : x[1] > 0.03)
lines.pprint()
ssc.start()
ssc.awaitTermination()
The cpu utilisation though is around 40% and does not get higher.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论