EMR主节点与EMR笔记本中的软件包列表中的软件包列表
我有一个EMR群集启动并运行。在其中,我有一个带有pyspark
内核的jupyter笔记本。
对于主节点,我可以将ssh
进入其中。我可以轻松地在主节点中安装Python软件包,例如:
pip install pandas
然后我可以通过pip Freeze
成功验证,
但是,当我转到pyspark
笔记本时, sc.list_packages()
,我在其中看到了不同的软件包列表。与主节点相比,某些软件包具有不同的版本。某些软件包(例如pandas
)并未完全显示。
这是主节点ssh
中pip冻结
的列表。
aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.1
boto==2.49.0
click==7.1.2
Cython==0.29.30
docutils==0.14
jmespath==0.10.0
joblib==0.15.1
lockfile==0.11.0
lxml==4.5.1
mysqlclient==1.4.2
nltk==3.5
nose==1.3.4
numpy==1.21.6
pandas==1.3.5
py-dateutil==2.2
py4j==0.10.9.5
pybind11==2.9.2
pyspark==3.3.0
pystache==0.5.4
python-daemon==2.2.3
python-dateutil==2.8.2
python37-sagemaker-pyspark==1.3.0
pytz==2020.1
PyYAML==5.3.1
regex==2020.6.8
scipy==1.7.3
simplejson==3.2.0
six==1.13.0
soupsieve==1.9.5
tqdm==4.46.1
windmill==1.6
这是pyspark
使用sc.list_packages()
:
aws-cfn-bootstrap (2.0)
beautifulsoup4 (4.9.1)
boto (2.49.0)
click (7.1.2)
docutils (0.14)
jmespath (0.10.0)
joblib (0.15.1)
lockfile (0.11.0)
lxml (4.5.1)
mysqlclient (1.4.2)
nltk (3.5)
nose (1.3.4)
numpy (1.16.5)
pip (9.0.1)
py-dateutil (2.2)
pystache (0.5.4)
python-daemon (2.2.3)
python37-sagemaker-pyspark (1.3.0)
pytz (2020.1)
PyYAML (5.3.1)
regex (2020.6.8)
setuptools (28.8.0)
simplejson (3.2.0)
six (1.13.0)
soupsieve (1.9.5)
tqdm (4.46.1)
UNKNOWN (1.3.5)
wheel (0.29.0)
windmill (1.6)
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
You are using pip version 9.0.1, however version 22.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
注意pandas
,scipy
:和pip
是不同的。为什么它们与众不同?如何在pyspark
笔记本中升级或更新列表?
I have one EMR cluster up and running. In it, I have one Jupyter Notebook with pyspark
kernel.
For the master node, I am able to SSH
into it. I am able to install Python packages in the master node easily, such as :
pip install pandas
which I can then verify successful with pip freeze
However, when I go to the pyspark
notebook, using sc.list_packages()
, I see a different list of packages in there. Some package has different version compared to in the master node. Some package (such as pandas
) does not appear altogether.
Here is the list of pip freeze
in master node SSH
.
aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.1
boto==2.49.0
click==7.1.2
Cython==0.29.30
docutils==0.14
jmespath==0.10.0
joblib==0.15.1
lockfile==0.11.0
lxml==4.5.1
mysqlclient==1.4.2
nltk==3.5
nose==1.3.4
numpy==1.21.6
pandas==1.3.5
py-dateutil==2.2
py4j==0.10.9.5
pybind11==2.9.2
pyspark==3.3.0
pystache==0.5.4
python-daemon==2.2.3
python-dateutil==2.8.2
python37-sagemaker-pyspark==1.3.0
pytz==2020.1
PyYAML==5.3.1
regex==2020.6.8
scipy==1.7.3
simplejson==3.2.0
six==1.13.0
soupsieve==1.9.5
tqdm==4.46.1
windmill==1.6
And here is the package list in the PySpark
notebook using sc.list_packages()
:
aws-cfn-bootstrap (2.0)
beautifulsoup4 (4.9.1)
boto (2.49.0)
click (7.1.2)
docutils (0.14)
jmespath (0.10.0)
joblib (0.15.1)
lockfile (0.11.0)
lxml (4.5.1)
mysqlclient (1.4.2)
nltk (3.5)
nose (1.3.4)
numpy (1.16.5)
pip (9.0.1)
py-dateutil (2.2)
pystache (0.5.4)
python-daemon (2.2.3)
python37-sagemaker-pyspark (1.3.0)
pytz (2020.1)
PyYAML (5.3.1)
regex (2020.6.8)
setuptools (28.8.0)
simplejson (3.2.0)
six (1.13.0)
soupsieve (1.9.5)
tqdm (4.46.1)
UNKNOWN (1.3.5)
wheel (0.29.0)
windmill (1.6)
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
You are using pip version 9.0.1, however version 22.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Note that pandas
, scipy
and pip
are different. Why are they different? How do I upgrade or update the list in the PySpark
notebook?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
登录主节点并运行sudo
docker ps -a
。您应该看到一个名为emr/jupyter-notebook的容器:6.0.3
,这就是您的jupyter笔记本运行的地方;它没有在主节点中运行。如果您决定在主节点中安装任何软件包,则Jupyter笔记本将看不到它们。这就是为什么您的软件包不匹配的原因。要在Jupyter笔记本电脑中安装软件包,我使用了一个要求文件,其中包含我要安装的软件包,并调用安装这些软件包的Bootstrap Action脚本。一个重要的细节是确保如果您确实指定了软件包版本,则必须在容器中运行的Python版本对其进行支持。要找出Jupyter笔记本中的步骤:
要查找具有特定版本Python的最新软件包,我强烈建议您使用Anaconda。例如,
将告诉我最新版本的Matplotlib,它与Python一起使用3.7.9
Log into the master node and run sudo
docker ps -a
. You should see a container named something likeemr/jupyter-notebook:6.0.3
and that's where your Jupyter Notebook is running; it is not running in the master node.If you decide to install any packages in the master node, the Jupyter Notebook will not see them. This is the reason why your packages do not match. To install packages in the Jupyter Notebook I use a requirements file, which contains the packages I want to install, and invoke a bootstrap action script that installs those packages. An important detail is to make sure that if you do specify a package version then it must be supported by the Python version running in the container. To find out just run a step in the Jupyter Notebook:
To find the latest packages that go with a specific version of Python, I highly recommend using Anaconda. For example
will tell me the latest version of matplotlib that works with Python 3.7.9