(Py)Spark 上的 Presto - 设置
通过我作为数据工程师的工作,我的大部分数据逻辑是通过:
急速查询(中小型数据集和计算,最适合与分析师合作)
Spark(大枪,当计算量相当大时),主要在 python 环境中。
尽管有几种情况,我更喜欢使用 presto(数据草图函数)更智能地工作,然后使用 Spark。
我正在寻找的场景是将 presto 框架与 pyspark 集成。我无法在会话级别正确设置它。 尝试将 presto 文档中的 Spark-submit 示例转换为 pyspark 会话生成器,但没有任何运气。
/spark/bin/spark-提交
--master Spark://spark-master:7077
--执行器核心 4
--conf Spark.task.cpus=4
--class com.facebook.presto.spark.launcher.PrestoSparkLauncher
presto-spark-launcher-0.271.jar
--package presto-spark-package-0.271.tar.gz
--config /presto/etc/config.properties
--catalogs /presto/etc/catalogs
--目录配置单元
--架构默认值
--file query.sql
成功加载 json: config snap
但是当尝试使用 presto 语法运行 Spark.sql 时,我仍然失败。 我缺少什么?
我正在关注这个文档: https://prestodb.io/docs/current/installation/spark.html
这个官方 GIT: https://github.com/prestodb/presto/issues/13856
并尝试通过网络获取更多数据(数量不多) https://medium.com/@ravishankar.nair/the-ultimate-duo-in-distributed-computing-prestodb-running-on-spark-b63d0e567eeb
Through my job as a data engineer most of my data logic is through:
presto queries (small - medium data sets and calculation, easiest for working with analysts)
Spark (the big guns, when the calculations are pretty heavy), mostly in python environment.
Although there are several scenarios I prefer to work smarter with presto (data sketch functions) then utilize spark.
The scenario I'm looking for is to integrate presto framework with pyspark. and Im failing to setup it correctly in the session level.
tried to convert the spark-submit example in presto docs to a pyspark session builder without any luck.
/spark/bin/spark-submit
--master spark://spark-master:7077
--executor-cores 4
--conf spark.task.cpus=4
--class com.facebook.presto.spark.launcher.PrestoSparkLauncher
presto-spark-launcher-0.271.jar
--package presto-spark-package-0.271.tar.gz
--config /presto/etc/config.properties
--catalogs /presto/etc/catalogs
--catalog hive
--schema default
--file query.sql
loaded the jsons successfully:
config snap
but when trying to run spark.sql with presto syntax I still fail.
What am I missing?
I was following this documentation:
https://prestodb.io/docs/current/installation/spark.html
This official GIT:
https://github.com/prestodb/presto/issues/13856
And tried to get more data across the web (there isn't much)
https://medium.com/@ravishankar.nair/the-ultimate-duo-in-distributed-computing-prestodb-running-on-spark-b63d0e567eeb
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论