Apache Spark Cache在衍生数据框架上是否有效?

发布于 2025-01-29 08:01:05 字数 573 浏览 3 评论 0原文

我正在使用Apache Spark进行一些工作,但是我不确定数据框“ Frame3”是否会使用“ Frame1”中的缓存数据。在下面概念上描述方案的代码:

frame1 = spark.read.csv("hdfs:....")
frame1.cache()
frame2 = frame1.select("name", "price").filter("price > 20")
frame2.show() #Data is being cached so this action takes longer
frame2.show() #Data has been cached so this action takes a short amount of time
frame3 = frame2.select("name","price").filter("price > 30")
frame3.show() #Does this action use the cached data from frame 1 or not since frame 2 was built from frame 1?

有人有任何想法吗?

谢谢, 极光

I'm doing some work using Apache Spark but I am not exactly sure whether the dataframe "frame3" will used cached data from "frame1" or not. Code describing the scenario conceptually below:

frame1 = spark.read.csv("hdfs:....")
frame1.cache()
frame2 = frame1.select("name", "price").filter("price > 20")
frame2.show() #Data is being cached so this action takes longer
frame2.show() #Data has been cached so this action takes a short amount of time
frame3 = frame2.select("name","price").filter("price > 30")
frame3.show() #Does this action use the cached data from frame 1 or not since frame 2 was built from frame 1?

Does anyone have any thoughts?

Thanks,
Aurora

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

锦上情书 2025-02-05 08:01:05

在上面的情况下。 DataFrame Frame3将执行frame2转换。但是,在进行此转换时,而不是从csv读取数据时,它将使用CACHE dataframe frame> frame> frame> frame 1

spark spark使用lazy Evolution 的版本具有更好的优化。因此,在您采取任何操作之前,没有执行任何转换。这对于您在单个数据框架上进行多次转换的情况是有好处的。

但是,将单个转换的数据框架引用到其他多个位置的情况下,缓存该数据框是一个好主意。话虽如此。在上面的示例中,我看不到frame1在其他任何地方都被引用,因此即使缓存也没有意义。 (除非这只是一个示例,否则

请注意:根据评论更新答案,我错过了我们的一些重要信息。直到对数据框执行适当的措施,它没有被缓存。

in above scenario. DataFrame frame3 will execute the frame2 transformation. However, while doing this transformation, rather than reading a data from csv it will used the cache version of the dataframe frame 1

Spark uses lazy evolution to have better optimization. And so until you do any action none of the transformation is executed. This is good for the cases where you are doing multiple transformation on a single dataframe.

However, cases where a single transformed dataframe is being referenced into multiple other places, it's a good idea to cache that dataframe. Having said that. in above example I don't see frame1 is being referenced anywhere else and so it won't make sense to even cache that. (unless if that's just an example to understand)

Note: updating answer as per comment, I missed our some important information. that until a proper action is performed on the dataframe it's not being cached.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文