在Google DataFlow作业上安装Chromedriver
我想将刮擦的容器化Python应用程序迁移到可以在数据流上运行的Apache Beam Pipeline。我的刮擦应用程序使用2种刮擦方法:卷曲响应和硒铬。
在本地运行应用程序时,一切都可以正常运行,因为刮擦成功地使用了这两种方法运行,因为我在本地计算机上安装了Chromedriver。
现在的问题是在数据流程上。我知道Google DataFlow是无服务器的。我只是想知道是否有一种方法可以在运行管道时在数据流工作人员中安装Chromedriver?
当我在没有驱动程序的情况下部署管道时,错误看起来像这样: selenium.common.exceptions.webdriverexception:消息:“ Chromedriver”可执行文件需要在路径中。请参阅https://chromedriver.chromium.org/home [在运行'pardo(scrapeContent)-ptransform-47']
>
I wanted to migrate a scraping containerised python application to an apache beam pipeline that I can run on dataflow. My scraping application uses 2 scraping methods: a curl response and selenium chromedriver.
While running the application locally, everything works fine as the scraping is successfully running using both methods because I have chromedriver installed on my local machine.
The issue now is on dataflow. I know that google dataflow is serverless. I'm just wondering if there is a way I can install chromedriver in the dataflow workers while running my pipeline?
When I deploy my pipeline without the driver, the error looks like this:selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://chromedriver.chromium.org/home [while running 'ParDo(ScrapeContent)-ptransform-47']
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
解决方案是使用 dataflow
您配置了安装apache-beam [GCP]和Chromedriver的Docker容器,以及
您在构建图像之后的所有要求:
gcloud build builds提交。 -tag gcr.io/$ project/qurepo:$ tag
以及要将作业提交到数据流时:
The solution for this is to use custom containers in Dataflow
You configure a docker container where you install apache-beam[GCP] and chromedriver and all your requirements
After that you build your image:
gcloud builds submit . --tag gcr.io/$PROJECT/$REPO:$TAG
And when you want to submit your job to dataflow: