Pyspark 流式处理 CPU 利用率低

发布于 2025-01-17 07:34:42 字数 793 浏览 2 评论 0原文

我正在尝试使用 python 中的 Spark 流在流上运行一些处理。该流每秒给出 2000 个数据点，并且处理应该每 4 秒给出一次结果 (slideDuration=4)，但实际上它被延迟了很多（在大约 30 秒后打印每个结果）。

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working thread and batch interval of 1 second
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 2)
ssc.checkpoint("./checkpoints")  # set checkpoint directory

lines = ssc.socketTextStream("localhost", 9000)
lines = lines.map(lambda x : (x, 1/40000))
lines = lines.reduceByKeyAndWindow(lambda a, b: a + b, lambda a, b: a - b, windowDuration=20, slideDuration=4)
lines = lines.filter(lambda x : x[1] > 0.03)

lines.pprint()
ssc.start()
ssc.awaitTermination()

但 CPU 利用率约为 40%，并且没有更高。

原文

I'm trying to run some processing on a stream with Spark streaming in python.
The stream is giving out 2000 datapoints each second, and the processing is supposed to give results each 4 seconds (slideDuration=4), but in fact it is being delayed a lot (printing the each result after ~30 seconds).

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working thread and batch interval of 1 second
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 2)
ssc.checkpoint("./checkpoints")  # set checkpoint directory

lines = ssc.socketTextStream("localhost", 9000)
lines = lines.map(lambda x : (x, 1/40000))
lines = lines.reduceByKeyAndWindow(lambda a, b: a + b, lambda a, b: a - b, windowDuration=20, slideDuration=4)
lines = lines.filter(lambda x : x[1] > 0.03)

lines.pprint()
ssc.start()
ssc.awaitTermination()

The cpu utilisation though is around 40% and does not get higher.

分享到QQ

分享到微博