request.urlopen(url)不返回网站响应或超时

发布于 2025-02-01 14:10:10 字数 487 浏览 2 评论 0原文

我想拿一些网站来源。当我尝试获得响应时,程序就陷入困境并等待响应。无论我等多长时间都没有超时或响应。这是我的代码:

link = "https://eu.mouser.com/"
linkResponse = urllib.request.urlopen(link)
readedResponse = linkResponse.readlines()
writer = open("html.txt", "w")
for line in readedResponse:
    writer.write(str(line))
    writer.write("\n")
writer.close()

当我尝试到其他网站时,urlopen返回他们的响应。但是,当我尝试获取“ eu.mouser.com”和“ uk.farnell.com”时,他们不会返回他们的回复。我会跳过他们的反应,甚至urlopen也没有返回超时。那里有什么问题?有其他方法可以获取网站的来源吗? (对不起,我的英语不好)

I want to take some website's sources for a project. When i try to get response, program just stuck and wait for response. No matter how long i wait no timeout or response. Here is my code:

link = "https://eu.mouser.com/"
linkResponse = urllib.request.urlopen(link)
readedResponse = linkResponse.readlines()
writer = open("html.txt", "w")
for line in readedResponse:
    writer.write(str(line))
    writer.write("\n")
writer.close()

When i try to other websites, urlopen return their response. But when i try to get "eu.mouser.com" and "uk.farnell.com" not return their response. I ll skip their response, even urlopen not return a timeout. What is the problem there? Is there another way to take the website's sources? (Sorry for my bad english)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

我是有多爱你 2025-02-08 14:10:10

文档声称这一点

可选的超时参数指定以秒为单位的超时
阻止操作,例如连接尝试(如果未指定,则
将使用全局默认超时设置)。这实际上只能有效
对于HTTP,HTTPS和FTP连接。

在没有解释如何查找上述默认值的情况下,我在直接提供5(秒)之后设法引发

import urllib.request
url = "https://uk.farnell.com"
urllib.request.urlopen(url, timeout=5)

超时

socket.timeout: The read operation timed out

urllib.request.urlopen docs claims that

The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used). This actually only works
for HTTP, HTTPS and FTP connections.

without explaining how to find said default, I managed to provoke timeout after directly providing 5 (seconds) as timeout

import urllib.request
url = "https://uk.farnell.com"
urllib.request.urlopen(url, timeout=5)

gives

socket.timeout: The read operation timed out
我为君王 2025-02-08 14:10:10

有些站点通过实施检测到此类机器人的机制来保护自己免受自动爬行者的侵害。这些可能非常多样化,并且随着时间的流逝而变化。如果您真的想尽一切可能使页面自动爬行,这通常意味着您必须实施自己的步骤来规避这些保护性障碍。

一个例子是每个请求中提供的标题信息。可以在提出请求之前进行更改,例如。但是这里可能还有更多事情要做。

如果您有兴趣开始开发这样的东西(撇开是否允许这样做的问题),则可以将其作为起点:

from collections import namedtuple
from contextlib import suppress

import requests
from requests import ReadTimeout

Link = namedtuple("Link", ["url", "filename"])
links = {
    Link("https://eu.mouser.com/", "mouser.com"),
    Link("https://example.com/", "example1.com"),
    Link("https://example.com/", "example2.com"),
}

for link in links:
    with suppress(ReadTimeout):
        response = requests.get(link.url, timeout=3)
        with open(f"html-{link.filename}.txt", "w", encoding="utf-8") as file:
            file.write(response.text)

在此处导致ReadTime> ReadTimeout错误的受保护站点简单地被忽略,并且有可能进一步发展 - 例如,使用合适的tume> timeout = 3)增强requests.gets.get(link.url,timeout = 3)使用合适的标头参数。但是,正如我已经提到的那样,这可能不是唯一必须进行的定制,也应阐明法律方面。

There are some sites that protect themselves from automated crawlers by implementing mechanisms that detect such bots. These can be very diverse and also change over time. If you really want to do everything you can to get the page crawled automatically, this usually means that you have to implement steps yourself to circumvent these protective barriers.

One example of this is the header information that is provided with every request. This can be changed before making the request, e.g. via request's header customization. But there are probably more things to do here and there.

If you're interested in starting developing such a thing (leaving aside the question of whether this is allowed at all), you can take this as a starting point:

from collections import namedtuple
from contextlib import suppress

import requests
from requests import ReadTimeout

Link = namedtuple("Link", ["url", "filename"])
links = {
    Link("https://eu.mouser.com/", "mouser.com"),
    Link("https://example.com/", "example1.com"),
    Link("https://example.com/", "example2.com"),
}

for link in links:
    with suppress(ReadTimeout):
        response = requests.get(link.url, timeout=3)
        with open(f"html-{link.filename}.txt", "w", encoding="utf-8") as file:
            file.write(response.text)

where such protected sites which lead to ReadTimeOut errors are simply ignored and with the possibility to go further - e.g. by enhancing requests.get(link.url, timeout=3) with a suitable headers parameter. But as I already mentioned, this is probably not the only customization which had to be done and the legal aspects should also be clarified.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文