PySpark 基于 SQL 表达式的过滤器可接受的语法是什么?
过滤器的 PySpark 文档< /a> 表示它接受“一串 SQL 表达式”。
是否有此参数可接受的语法的参考?我能找到的最好的是关于 的页面Spark SQL 文档中的 WHERE 子句。显然有一些示例,例如 “id > 200”
、“length(name) > 3”
或 “id BETWEEN 200 AND 300”
>,会起作用。但其他人呢?像 "age > (SELECT 42)"
这样的过滤器似乎有效,所以我认为嵌套表达式没问题。这只会引发更多问题:
- 这些嵌套表达式可以引用哪些数据库?有没有办法创建一个引用当前数据帧的嵌套
SELECT
表达式,例如执行类似"age > (SELECT avg(age) FROM
)"< /code> 作为过滤器? (我知道还有其他方法可以实现此目的,我只对 SQL 表达式可以做什么感兴趣。) - 过滤器表达式中是否允许其他更高级的操作?
- 最后,是否有在线资源更详细地解释这一点?
The PySpark documentation for filters says that it accepts "a string of SQL expression".
Is there a reference of the accepted syntax for this parameter? The best I could find is the page about the WHERE clause in Spark SQL docs. Obviously some examples, like "id > 200"
, "length(name) > 3"
, or "id BETWEEN 200 AND 300"
, would work. But what about others? Filters like "age > (SELECT 42)"
seem to work, so I assume nested expressions are OK. This just raises more questions:
- What databases can these nested expressions refer to? Is there a way I can create a nested
SELECT
expression referring to the current dataframe, e.g. to do something like"age > (SELECT avg(age) FROM <current_dataframe>)"
as a filter? (I know there are other ways of achieving this, I am only interested in what SQL expressions can do.) - Are there other, more advanced things that are allowed in filter expressions?
- Finally, is there an online resource explaining this in more detail?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论