在pyspark中,如何选择n行DataFrame而不扫描整个表
我正在使用 pyspark,想要向用户展示一个(非常大,例如 1000 万)表的预览,例如,用户可以看到表中的 5000 行,(第一个/最后一个/随机,任何 5000 行都可以),那么从表中获取 n 行的最快方法是什么?我尝试过limit
、sample
,但是这些函数仍然会扫描全表,时间复杂度都是O(N*),需要很多时间。
spark.sql('select * from some_table').limit(N)
有人可以帮助我吗?
I'm using pyspark, and want to show user a preview of a (very large, 10 million for example) table, for example, user can see 5000 rows in the table, (first/last/random, any 5000 rows are ok), so what is the fastest way to get n rows from the table? I have tried limit
, sample
, but these function will still scan the whole table, the time complexity are O(N*), which takes a lot of time.
spark.sql('select * from some_table').limit(N)
Can some help me.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
由于您是从 python 进行 sql 调用,因此这是迄今为止最简单的解决方案。而且速度很快。我不认为当你使用 sql 调用时它会扫描整个表。假设您的表已经缓存 - 您确定延迟是由扫描表引起的,还是由具体化表引起的?
作为替代方案,假设您有一个 python 数据帧句柄 df_some_table,它会变得更加棘手,因为 .head() 和 .show() 函数返回数据帧以外的其他内容,但它们可以用于查看数据帧。
Since you are making a sql call from python, this is by far the easiest solution. And it's fast. I don't think it scans the whole table when you use a sql call. Assuming your table is already cached- are you sure the delay is caused by scanning the table, or is it caused by materializing the table?
As an alternative, assuming you had a python dataframe handle, df_some_table, it gets trickier because the .head() and .show() functions return something other than a dataframe, but they can work for peeking at the dataframe.