urllib 正在破坏我的网址

发布于 2024-12-16 18:54:53 字数 1830 浏览 2 评论 0原文

我正在写一个小刮刀。这是到目前为止的代码。

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

soup = BeautifulSoup(
    urlopen('http://www.high-rely.com/HR3/includes/ProductFamily.php').read()
    )

links = soup.findAll('a', 'visible_link')

hrefs = ['www.high-rely.com' + relative for relative in [x['href'] for x in links]]

subpages = map(BeautifulSoup, [urlopen(x).read() for x in hrefs])

当我运行它时,我收到以下错误。

Traceback (most recent call last):
  File "C:/Users/josh.SCL/Desktop/Scraper.py", line 13, in <module>
    subpages = map(BeautifulSoup, [urlopen(x).read() for x in hrefs])
  File "C:\Python27\lib\urllib.py", line 84, in urlopen
    return opener.open(url)
  File "C:\Python27\lib\urllib.py", line 205, in open
    return getattr(self, name)(url)
  File "C:\Python27\lib\urllib.py", line 461, in open_file
    return self.open_local_file(url)
  File "C:\Python27\lib\urllib.py", line 475, in open_local_file
    raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified: 'www.high-rely.com\\HR3\\includes\\products\\5MinOverview.php'

如果我循环遍历 href,我会得到这个。

www.high-rely.com/HR3/includes/products/5MinOverview.php
www.high-rely.com/HR3/includes/products/10MinOverview.php
www.high-rely.com/HR3/includes/products/30MinOverview.php
www.high-rely.com/HR3/includes/HighRely/HighRely.php
www.high-rely.com/HR3/includes/HighRely/HighRely.php
www.high-rely.com/HR3/includes/RAIDFrame/RAIDFrame.php
www.high-rely.com/HR3/includes/RAIDFrame/RAIDFrame.php
www.high-rely.com/HR3/includes/MPac/MPac.php
www.high-rely.com/HR3/includes/MPac/MPac.php
www.high-rely.com/HR3/includes/BNAS/BNAS-HRS201.php
www.high-rely.com/HR3/includes/announcements.php

这是正确的。这是怎么回事?

I'm writing a little scraper. Here's the code so far.

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

soup = BeautifulSoup(
    urlopen('http://www.high-rely.com/HR3/includes/ProductFamily.php').read()
    )

links = soup.findAll('a', 'visible_link')

hrefs = ['www.high-rely.com' + relative for relative in [x['href'] for x in links]]

subpages = map(BeautifulSoup, [urlopen(x).read() for x in hrefs])

When I run it though, I get the following error.

Traceback (most recent call last):
  File "C:/Users/josh.SCL/Desktop/Scraper.py", line 13, in <module>
    subpages = map(BeautifulSoup, [urlopen(x).read() for x in hrefs])
  File "C:\Python27\lib\urllib.py", line 84, in urlopen
    return opener.open(url)
  File "C:\Python27\lib\urllib.py", line 205, in open
    return getattr(self, name)(url)
  File "C:\Python27\lib\urllib.py", line 461, in open_file
    return self.open_local_file(url)
  File "C:\Python27\lib\urllib.py", line 475, in open_local_file
    raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified: 'www.high-rely.com\\HR3\\includes\\products\\5MinOverview.php'

If I loop through hrefs, I get this.

www.high-rely.com/HR3/includes/products/5MinOverview.php
www.high-rely.com/HR3/includes/products/10MinOverview.php
www.high-rely.com/HR3/includes/products/30MinOverview.php
www.high-rely.com/HR3/includes/HighRely/HighRely.php
www.high-rely.com/HR3/includes/HighRely/HighRely.php
www.high-rely.com/HR3/includes/RAIDFrame/RAIDFrame.php
www.high-rely.com/HR3/includes/RAIDFrame/RAIDFrame.php
www.high-rely.com/HR3/includes/MPac/MPac.php
www.high-rely.com/HR3/includes/MPac/MPac.php
www.high-rely.com/HR3/includes/BNAS/BNAS-HRS201.php
www.high-rely.com/HR3/includes/announcements.php

Which is correct. What's going on here?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

季末如歌 2024-12-23 18:54:53

您忘记写 http://

hrefs = ['http://www.high-rely.com' + relative for relative in [x['href'] for x in links]]

You forgot to write http://:

hrefs = ['http://www.high-rely.com' + relative for relative in [x['href'] for x in links]]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文