在pyspark中，如何选择n行DataFrame而不扫描整个表

发布于 2025-01-11 04:41:23 字数 283 浏览 2 评论 0原文

我正在使用 pyspark，想要向用户展示一个（非常大，例如 1000 万）表的预览，例如，用户可以看到表中的 5000 行，（第一个/最后一个/随机，任何 5000 行都可以），那么从表中获取 n 行的最快方法是什么？我尝试过limit、sample，但是这些函数仍然会扫描全表，时间复杂度都是O(N*)，需要很多时间。

spark.sql('select * from some_table').limit(N)

有人可以帮助我吗？

原文

I'm using pyspark, and want to show user a preview of a (very large, 10 million for example) table, for example, user can see 5000 rows in the table, (first/last/random, any 5000 rows are ok), so what is the fastest way to get n rows from the table? I have tried limit, sample, but these function will still scan the whole table, the time complexity are O(N*), which takes a lot of time.

spark.sql('select * from some_table').limit(N)

Can some help me.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

柠檬色的秋千 2025-01-18 04:41:23

spark.sql('select * from some_table limit 10')

由于您是从 python 进行 sql 调用，因此这是迄今为止最简单的解决方案。而且速度很快。我不认为当你使用 sql 调用时它会扫描整个表。假设您的表已经缓存 - 您确定延迟是由扫描表引起的，还是由具体化表引起的？

作为替代方案，假设您有一个 python 数据帧句柄 df_some_table，它会变得更加棘手，因为 .head() 和 .show() 函数返回数据帧以外的其他内容，但它们可以用于查看数据帧。

df_some_table.head(N)
df_some_table.show(N)

spark.sql('select * from some_table limit 10')

Since you are making a sql call from python, this is by far the easiest solution. And it's fast. I don't think it scans the whole table when you use a sql call. Assuming your table is already cached- are you sure the delay is caused by scanning the table, or is it caused by materializing the table?

As an alternative, assuming you had a python dataframe handle, df_some_table, it gets trickier because the .head() and .show() functions return something other than a dataframe, but they can work for peeking at the dataframe.

df_some_table.head(N)
df_some_table.show(N)

回复收藏 0 原文

~没有更多了~

关于作者

不醒的梦

暂无简介

文章

28 人气

关注发私信

友情链接

文江博客

在pyspark中，如何选择n行DataFrame而不扫描整个表

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

15077827184

遗失的美好

离不开的别离

3857621955

懒猫

洋洋洒洒

友情链接

在pyspark中，如何选择n行DataFrame而不扫描整个表

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

15077827184

遗失的美好

离不开的别离

3857621955

懒猫

洋洋洒洒

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。