如何从Docker容器连接到HDFS?

发布于 2025-02-06 21:41:44 字数 4374 浏览 3 评论 0原文

我的目标是在气流中读取HDFS的文件并进行进一步的操作。

研究之后,我发现我需要使用的URL如下:

df = pd.read_parquet('http:// localhost:9870/webhdfs/v1/v1/hadoop_files/sample_20222_01.parquet?op = Open')代码>,

localhost/172.20.80.1/computer -name.mshome.net可以互换使用,

9870 -NAMENODE端口,

Hadoop_files/sample_2022_01.parquet-我的文件夹和文件中创建的文件和文件。

我可以在Pycharm中访问和阅读本地文件,但是在Docker中的气流中,我无法获得相同的结果。我尝试使用在Docker中托管的本地HDFS和HDFS,并将主机更改为host.docker.internal,但我遇到了相同的错误。

堆栈跟踪:

[2022-06-12, 17:52:45 UTC] {taskinstance.py:1889} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/urllib/request.py", line 1350, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/lib/python3.7/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1036, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 976, in send
    self.connect()
  File "/usr/local/lib/python3.7/http/client.py", line 948, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/usr/local/lib/python3.7/socket.py", line 728, in create_connection
    raise err
  File "/usr/local/lib/python3.7/socket.py", line 716, in create_connection
    sock.connect(sa)
OSError: [Errno 113] No route to host

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 207, in execute
    branch = super().execute(context)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 171, in execute
    return_value = self.execute_callable()
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 189, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/opt/airflow/dags/includes/parquet_dag/main.py", line 15, in main
    df_parquet = read('hdfs://localhost:9000/hadoop_files/sample_2022_01.parquet')
  File "/opt/airflow/dags/includes/parquet_dag/utils.py", line 29, in read
    df = pd.read_parquet('http://172.20.80.1:9870/webhdfs/v1/hadoop_files/sample_2022_01.parquet?op=OPEN')
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 500, in read_parquet
    **kwargs,
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 236, in read
    mode="rb",
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 102, in _get_path_or_handle
    path_or_handle, mode, is_text=False, storage_options=storage_options
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/common.py", line 614, in get_handle
    storage_options=storage_options,
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/common.py", line 312, in _get_filepath_or_buffer
    with urlopen(req_info) as req:
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/common.py", line 212, in urlopen
    return urllib.request.urlopen(*args, **kwargs)
  File "/usr/local/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/lib/python3.7/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/local/lib/python3.7/urllib/request.py", line 543, in _open
    '_open', req)
  File "/usr/local/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/local/lib/python3.7/urllib/request.py", line 1378, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/local/lib/python3.7/urllib/request.py", line 1352, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 113] No route to host>

使用host.docker.internal:

urllib.error.URLError: <urlopen error [Errno 99] Cannot assign requested address>

My goal is to read file from hdfs in airflow and do further manipulations.

After researching, I found that url I need to use is as follows:

df = pd.read_parquet('http://localhost:9870/webhdfs/v1/hadoop_files/sample_2022_01.parquet?op=OPEN'),

where localhost/172.20.80.1/computer-name.mshome.net can be interchangeably used,

9870 - namenode port,

hadoop_files/sample_2022_01.parquet - my folder and file created in the root.

I can access and read file locally in PyCharm, but I am unable to get the same result inside airflow in docker. I tried using local hdfs and hdfs hosted in docker and changing host to the host.docker.internal, but I am getting the same error.

Stack trace:

[2022-06-12, 17:52:45 UTC] {taskinstance.py:1889} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/urllib/request.py", line 1350, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/local/lib/python3.7/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1036, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 976, in send
    self.connect()
  File "/usr/local/lib/python3.7/http/client.py", line 948, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/usr/local/lib/python3.7/socket.py", line 728, in create_connection
    raise err
  File "/usr/local/lib/python3.7/socket.py", line 716, in create_connection
    sock.connect(sa)
OSError: [Errno 113] No route to host

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 207, in execute
    branch = super().execute(context)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 171, in execute
    return_value = self.execute_callable()
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 189, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/opt/airflow/dags/includes/parquet_dag/main.py", line 15, in main
    df_parquet = read('hdfs://localhost:9000/hadoop_files/sample_2022_01.parquet')
  File "/opt/airflow/dags/includes/parquet_dag/utils.py", line 29, in read
    df = pd.read_parquet('http://172.20.80.1:9870/webhdfs/v1/hadoop_files/sample_2022_01.parquet?op=OPEN')
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 500, in read_parquet
    **kwargs,
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 236, in read
    mode="rb",
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 102, in _get_path_or_handle
    path_or_handle, mode, is_text=False, storage_options=storage_options
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/common.py", line 614, in get_handle
    storage_options=storage_options,
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/common.py", line 312, in _get_filepath_or_buffer
    with urlopen(req_info) as req:
  File "/home/airflow/.local/lib/python3.7/site-packages/pandas/io/common.py", line 212, in urlopen
    return urllib.request.urlopen(*args, **kwargs)
  File "/usr/local/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/lib/python3.7/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/local/lib/python3.7/urllib/request.py", line 543, in _open
    '_open', req)
  File "/usr/local/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/local/lib/python3.7/urllib/request.py", line 1378, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/local/lib/python3.7/urllib/request.py", line 1352, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 113] No route to host>

With host.docker.internal:

urllib.error.URLError: <urlopen error [Errno 99] Cannot assign requested address>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

凉城已无爱 2025-02-13 21:41:45

localhost/172.20.80.1/computer-name.mshome.net可以互换使用,

它们不应在Docker网络中互换。

从气流中,您可以使用Docker服务名称,不是 IP地址,并确保容器位于同一桥网络中(不是主机模式,仅适用于Linux)。 host.docker.internal也不正确,因为您要尝试到达另一个容器,而不是主机

https://docs.docker.com/network/bridge/

我还建议使用气流火花操作员实际使用Spark,而不是Pandas或WebHDFS来实际读取HDFS的Parquet。如果需要

where localhost/172.20.80.1/computer-name.mshome.net can be interchangeably used,

They shouldn't be interchangeable inside Docker network.

From Airflow, you could use Docker service names, not IP addresses, and ensure the containers are in the same bridge network (not host mode, which only works on Linux). host.docker.internal isn't correct either since you're trying to reach another container, not your host

https://docs.docker.com/network/bridge/

I'd also recommend using Airflow Spark operators to actually read Parquet from HDFS, using Spark, not Pandas or WebHDFS. You can convert Spark dataframes to Pandas, if needed

盗心人 2025-02-13 21:41:44

您需要在气流码头容器中使用任何可路由地址。

如果Hadoop也在Docker容器内部,请使用Docker Inspect Container doc )。如果Hadoop在Localhost上,则可以设置network_mode:“ host” doc

也有一个重要的通知,如果您在MacOS上并拥有基本上是虚拟机的Docker Desktop应用程序。因此,在这种情况下,您需要一些额外的设置,请检查一个>,例如。

you need to use any routable address inside airflow docker container.

if hadoop is inside docker container as well, check it ip address using docker inspect CONTAINER (doc). if hadoop is on localhost you can set network_mode: "host" (doc)

also there is an important notice if you are on macos and have the docker desktop app which basically a virtual machine. so in this case you need some extra settings, check this, for example.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文