我能让 conda 更快地解决这个环境吗?
我在通过 GitLab CI 运行的模块中使用 geopandas...并且环境解决步骤需要很长时间。比如,大约需要 30 分钟来解决问题,而需要 2 分钟来运行该作业。
在每个 CI 作业中,
- 都会启动一个带有临时映像的容器,
- 所需的依赖项
- 创建一个 conda 环境,其中包含安装包
并运行一个脚本。当然,我可以为此作业创建一个特定的映像并执行以下操作:仅解决一次的负担,但这意味着依赖关系将被冻结......这不是预期的行为。
正如 geopandas 文档中建议的那样,我使用 conda-forge 通道。
这是环境文件:
name: my_package
channels:
- conda-forge
dependencies:
- conda-forge::python
- conda-forge::numpy
- conda-forge::pandas
- conda-forge::geopandas
- conda-forge::geopy
- conda-forge::pyarrow
- conda-forge::scikit-learn
- conda-forge::matplotlib
- conda-forge::coverage
- conda-forge::shapely
- conda-forge::intake
- conda-forge::pytest
- conda-forge::sphinx
- conda-forge::pysmb
- conda-forge::xlrd
- conda-forge::openpyxl
- conda-forge::sphinx_rtd_theme
关于如何加速环境解决的任何想法?
I am using geopandas in a module that is run through GitLab CI... and the environment solving step takes forever. Like, around 30 minutes of solving for 2 minutes of running the job.
At each CI job
- a container with the ad hoc image is started
- a conda environment is created with the dependencies needed for the package
- the package is installed and a script is run
Of course, I could create a specific image for this job and go through the burden of solving only once but this means dependencies would be frozen... and this is not the expected behavior.
As is recommended in geopandas documentation, I use the conda-forge channel.
Here is the environment file:
name: my_package
channels:
- conda-forge
dependencies:
- conda-forge::python
- conda-forge::numpy
- conda-forge::pandas
- conda-forge::geopandas
- conda-forge::geopy
- conda-forge::pyarrow
- conda-forge::scikit-learn
- conda-forge::matplotlib
- conda-forge::coverage
- conda-forge::shapely
- conda-forge::intake
- conda-forge::pytest
- conda-forge::sphinx
- conda-forge::pysmb
- conda-forge::xlrd
- conda-forge::openpyxl
- conda-forge::sphinx_rtd_theme
Any idea on how to speed up environment solving?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我同意 @Olsgaard 的建议,值得考虑重新设计 CI 工作流程,将图像生成与测试阶段分离。然而,从技术上来说,这并不像所质疑的那样“加快环境解决速度”。
为了更快地解决问题:
使用Mamba
至少固定
python
版本,例如python=3.9
。还可以考虑添加 DAG“集线器”的最低版本,例如 numpy、pandas 等。这将大大减少解决方案空间。I agree with @Olsgaard's suggestion, that it's worth considering a redesign of the CI workflow to decouple the image generation from the testing phase. However, that doesn't technically "speed up environment solving" as was queried.
For faster solves:
Use Mamba, as @FlyingTeller mentioned. This provides fast solving by using a compiled SAT solver rather than Python.
At least pin the
python
version, e.g.,python=3.9
. Consider also adding minimum versions for DAG "hubs" likenumpy
,pandas
, etc.. This would vastly reduce the solution space.有几种方法可以解决这个问题。您可以做的是让 CI 管道运行 3 个步骤
只要步骤
b
和c
并行运行,映像创建不会妨碍您的测试,并且由于您总是更新环境,因此步骤a
运行速度会快得多。您可以在步骤b
中添加逻辑,以确保它仅在需要时构建新映像。There are a few paths to solve this. What you could do is have the CI pipeline run 3 steps
As long as step
b
andc
run in parallel, the image creation won't hinder your tests, and since you are always updating your environment, stepa
will run much faster. You can add logic in stepb
, to make sure it only builds a new image when needed.使用 mambaforge,从 .sh 文件安装并使用 mamba 安装 conda 软件包。应该将您的设置时间减少到几分钟或更短。
https://github.com/conda-forge/miniforge
比如:
然后我想喜欢:
Use mambaforge, install from the .sh file and use mamba to install conda packages. Should reduce your setup time to minutes or less.
https://github.com/conda-forge/miniforge
Something like:
Then I think something like: