EMR主节点与EMR笔记本中的软件包列表中的软件包列表

发布于 2025-02-09 06:39:41 字数 2045 浏览 3 评论 0原文

我有一个EMR群集启动并运行。在其中,我有一个带有pyspark内核的jupyter笔记本。

对于主节点,我可以将ssh进入其中。我可以轻松地在主节点中安装Python软件包,例如:

pip install pandas

然后我可以通过pip Freeze成功验证,

但是,当我转到pyspark笔记本时, sc.list_packages(),我在其中看到了不同的软件包列表。与主节点相比,某些软件包具有不同的版本。某些软件包(例如pandas)并未完全显示。

这是主节点sshpip冻结的列表。

aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.1
boto==2.49.0
click==7.1.2
Cython==0.29.30
docutils==0.14
jmespath==0.10.0
joblib==0.15.1
lockfile==0.11.0
lxml==4.5.1
mysqlclient==1.4.2
nltk==3.5
nose==1.3.4
numpy==1.21.6
pandas==1.3.5
py-dateutil==2.2
py4j==0.10.9.5
pybind11==2.9.2
pyspark==3.3.0
pystache==0.5.4
python-daemon==2.2.3
python-dateutil==2.8.2
python37-sagemaker-pyspark==1.3.0
pytz==2020.1
PyYAML==5.3.1
regex==2020.6.8
scipy==1.7.3
simplejson==3.2.0
six==1.13.0
soupsieve==1.9.5
tqdm==4.46.1
windmill==1.6

这是pyspark使用sc.list_packages()

aws-cfn-bootstrap (2.0)
beautifulsoup4 (4.9.1)
boto (2.49.0)
click (7.1.2)
docutils (0.14)
jmespath (0.10.0)
joblib (0.15.1)
lockfile (0.11.0)
lxml (4.5.1)
mysqlclient (1.4.2)
nltk (3.5)
nose (1.3.4)
numpy (1.16.5)
pip (9.0.1)
py-dateutil (2.2)
pystache (0.5.4)
python-daemon (2.2.3)
python37-sagemaker-pyspark (1.3.0)
pytz (2020.1)
PyYAML (5.3.1)
regex (2020.6.8)
setuptools (28.8.0)
simplejson (3.2.0)
six (1.13.0)
soupsieve (1.9.5)
tqdm (4.46.1)
UNKNOWN (1.3.5)
wheel (0.29.0)
windmill (1.6)

DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
You are using pip version 9.0.1, however version 22.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

注意pandasscipy :和pip是不同的。为什么它们与众不同?如何在pyspark笔记本中升级或更新列表?

I have one EMR cluster up and running. In it, I have one Jupyter Notebook with pyspark kernel.

For the master node, I am able to SSH into it. I am able to install Python packages in the master node easily, such as :

pip install pandas

which I can then verify successful with pip freeze

However, when I go to the pyspark notebook, using sc.list_packages(), I see a different list of packages in there. Some package has different version compared to in the master node. Some package (such as pandas) does not appear altogether.

Here is the list of pip freeze in master node SSH.

aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.1
boto==2.49.0
click==7.1.2
Cython==0.29.30
docutils==0.14
jmespath==0.10.0
joblib==0.15.1
lockfile==0.11.0
lxml==4.5.1
mysqlclient==1.4.2
nltk==3.5
nose==1.3.4
numpy==1.21.6
pandas==1.3.5
py-dateutil==2.2
py4j==0.10.9.5
pybind11==2.9.2
pyspark==3.3.0
pystache==0.5.4
python-daemon==2.2.3
python-dateutil==2.8.2
python37-sagemaker-pyspark==1.3.0
pytz==2020.1
PyYAML==5.3.1
regex==2020.6.8
scipy==1.7.3
simplejson==3.2.0
six==1.13.0
soupsieve==1.9.5
tqdm==4.46.1
windmill==1.6

And here is the package list in the PySpark notebook using sc.list_packages():

aws-cfn-bootstrap (2.0)
beautifulsoup4 (4.9.1)
boto (2.49.0)
click (7.1.2)
docutils (0.14)
jmespath (0.10.0)
joblib (0.15.1)
lockfile (0.11.0)
lxml (4.5.1)
mysqlclient (1.4.2)
nltk (3.5)
nose (1.3.4)
numpy (1.16.5)
pip (9.0.1)
py-dateutil (2.2)
pystache (0.5.4)
python-daemon (2.2.3)
python37-sagemaker-pyspark (1.3.0)
pytz (2020.1)
PyYAML (5.3.1)
regex (2020.6.8)
setuptools (28.8.0)
simplejson (3.2.0)
six (1.13.0)
soupsieve (1.9.5)
tqdm (4.46.1)
UNKNOWN (1.3.5)
wheel (0.29.0)
windmill (1.6)

DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
You are using pip version 9.0.1, however version 22.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

Note that pandas, scipy and pip are different. Why are they different? How do I upgrade or update the list in the PySpark notebook?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

千紇 2025-02-16 06:39:41

登录主节点并运行sudo docker ps -a。您应该看到一个名为emr/jupyter-notebook的容器:6.0.3,这就是您的jupyter笔记本运行的地方;它没有在主节点中运行。

如果您决定在主节点中安装任何软件包,则Jupyter笔记本将看不到它们。这就是为什么您的软件包不匹配的原因。要在Jupyter笔记本电脑中安装软件包,我使用了一个要求文件,其中包含我要安装的软件包,并调用安装这些软件包的Bootstrap Action脚本。一个重要的细节是确保如果您确实指定了软件包版本,则必须在容器中运行的Python版本对其进行支持。要找出Jupyter笔记本中的步骤:

import sys
print(sys.version)

要查找具有特定版本Python的最新软件包,我强烈建议您使用Anaconda。例如,

conda create --name requests python=3.7.9 matplotlib

将告诉我最新版本的Matplotlib,它与Python一起使用3.7.9

Log into the master node and run sudo docker ps -a. You should see a container named something like emr/jupyter-notebook:6.0.3 and that's where your Jupyter Notebook is running; it is not running in the master node.

If you decide to install any packages in the master node, the Jupyter Notebook will not see them. This is the reason why your packages do not match. To install packages in the Jupyter Notebook I use a requirements file, which contains the packages I want to install, and invoke a bootstrap action script that installs those packages. An important detail is to make sure that if you do specify a package version then it must be supported by the Python version running in the container. To find out just run a step in the Jupyter Notebook:

import sys
print(sys.version)

To find the latest packages that go with a specific version of Python, I highly recommend using Anaconda. For example

conda create --name requests python=3.7.9 matplotlib

will tell me the latest version of matplotlib that works with Python 3.7.9

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文