当前位置：文江博客话题详情

Python beautifulsoup urllib

urllib 正在破坏我的网址

发布于 2024-12-16 18:54:53 字数 1830 浏览 2 评论 0原文

我正在写一个小刮刀。这是到目前为止的代码。

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

soup = BeautifulSoup(
    urlopen('http://www.high-rely.com/HR3/includes/ProductFamily.php').read()
    )

links = soup.findAll('a', 'visible_link')

hrefs = ['www.high-rely.com' + relative for relative in [x['href'] for x in links]]

subpages = map(BeautifulSoup, [urlopen(x).read() for x in hrefs])

当我运行它时，我收到以下错误。

Traceback (most recent call last):
  File "C:/Users/josh.SCL/Desktop/Scraper.py", line 13, in <module>
    subpages = map(BeautifulSoup, [urlopen(x).read() for x in hrefs])
  File "C:\Python27\lib\urllib.py", line 84, in urlopen
    return opener.open(url)
  File "C:\Python27\lib\urllib.py", line 205, in open
    return getattr(self, name)(url)
  File "C:\Python27\lib\urllib.py", line 461, in open_file
    return self.open_local_file(url)
  File "C:\Python27\lib\urllib.py", line 475, in open_local_file
    raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified: 'www.high-rely.com\\HR3\\includes\\products\\5MinOverview.php'

如果我循环遍历 href，我会得到这个。

www.high-rely.com/HR3/includes/products/5MinOverview.php
www.high-rely.com/HR3/includes/products/10MinOverview.php
www.high-rely.com/HR3/includes/products/30MinOverview.php
www.high-rely.com/HR3/includes/HighRely/HighRely.php
www.high-rely.com/HR3/includes/HighRely/HighRely.php
www.high-rely.com/HR3/includes/RAIDFrame/RAIDFrame.php
www.high-rely.com/HR3/includes/RAIDFrame/RAIDFrame.php
www.high-rely.com/HR3/includes/MPac/MPac.php
www.high-rely.com/HR3/includes/MPac/MPac.php
www.high-rely.com/HR3/includes/BNAS/BNAS-HRS201.php
www.high-rely.com/HR3/includes/announcements.php

这是正确的。这是怎么回事？

I'm writing a little scraper. Here's the code so far.

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

soup = BeautifulSoup(
    urlopen('http://www.high-rely.com/HR3/includes/ProductFamily.php').read()
    )

links = soup.findAll('a', 'visible_link')

hrefs = ['www.high-rely.com' + relative for relative in [x['href'] for x in links]]

subpages = map(BeautifulSoup, [urlopen(x).read() for x in hrefs])

When I run it though, I get the following error.

Traceback (most recent call last):
  File "C:/Users/josh.SCL/Desktop/Scraper.py", line 13, in <module>
    subpages = map(BeautifulSoup, [urlopen(x).read() for x in hrefs])
  File "C:\Python27\lib\urllib.py", line 84, in urlopen
    return opener.open(url)
  File "C:\Python27\lib\urllib.py", line 205, in open
    return getattr(self, name)(url)
  File "C:\Python27\lib\urllib.py", line 461, in open_file
    return self.open_local_file(url)
  File "C:\Python27\lib\urllib.py", line 475, in open_local_file
    raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified: 'www.high-rely.com\\HR3\\includes\\products\\5MinOverview.php'

If I loop through hrefs, I get this.

www.high-rely.com/HR3/includes/products/5MinOverview.php
www.high-rely.com/HR3/includes/products/10MinOverview.php
www.high-rely.com/HR3/includes/products/30MinOverview.php
www.high-rely.com/HR3/includes/HighRely/HighRely.php
www.high-rely.com/HR3/includes/HighRely/HighRely.php
www.high-rely.com/HR3/includes/RAIDFrame/RAIDFrame.php
www.high-rely.com/HR3/includes/RAIDFrame/RAIDFrame.php
www.high-rely.com/HR3/includes/MPac/MPac.php
www.high-rely.com/HR3/includes/MPac/MPac.php
www.high-rely.com/HR3/includes/BNAS/BNAS-HRS201.php
www.high-rely.com/HR3/includes/announcements.php

Which is correct. What's going on here?

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

评论（1）

季末如歌 2024-12-23 18:54:53

您忘记写 http://：

hrefs = ['http://www.high-rely.com' + relative for relative in [x['href'] for x in links]]

You forgot to write http://:

hrefs = ['http://www.high-rely.com' + relative for relative in [x['href'] for x in links]]

回复收藏 0 原文

~没有更多了~

关于作者

暂无简介

文章

评论

28 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

夢野间

文章 0 评论 0

百度③文鱼

文章 0 评论 0

小草泠泠

文章 0 评论 0

zhuwenyan

文章 0 评论 0

weirdo

文章 0 评论 0

坚持沉默

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文