python 中的网页抓取 urlopen
我正在尝试从该网站获取数据: http://www.boursorama.com/includes/cours/last_transactions.phtml ?symbole=1xEURUS
urlopen 似乎没有获取 html 代码,我不明白为什么。 它是这样的:
html = urllib.request.urlopen("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS")
print (html)
我的代码是正确的,我用相同的代码获取了其他网页的html源,但它似乎无法识别这个地址。
它打印: b''
也许另一个库更合适?为什么urlopen不返回网页的html代码? 帮忙谢谢!
I am trying to get the data from this website:
http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS
It seems like urlopen don't get the html code and I don't understand why.
It goes like:
html = urllib.request.urlopen("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS")
print (html)
My code is right, I get the html source of other webpages with the same code, but it seems like it doesn't recognise this address.
it prints: b''
Maybe another library is more appropriate? Why urlopen doesn't return the html code of the webpage?
help thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
就我个人而言,我写道:
Et si tu parles français,.. bonjour sur stackoverflow.com !
更新 1
事实上,我现在更喜欢使用以下代码,因为它更快:
在此代码中将
httplib
更改为http.client
应该足以使其适应 Python 3..
我确认,通过这两个代码,我获得了源代码,其中我看到了您感兴趣的数据:
update 2
将以下代码片段添加到上述代码中将允许您提取我认为您想要的数据:
result (只是最后)
我希望您不打算在外汇上“玩”交易:这是快速亏损的最佳方式之一。
更新3
抱歉!我忘了你使用的是 Python 3。所以我认为你必须像这样定义正则表达式:
也就是说在字符串前加上b,否则您会收到类似这个问题的错误
Personally , I write:
Et si tu parles français,.. bonjour sur stackoverflow.com !
update 1
In fact, I prefer now to employ the following code, because it is faster:
Changing
httplib
tohttp.client
in this code should be enough to adapt it to Python 3..
I confirm that, with these two codes, I obtain the source code in which I see the data in which you are interested:
update 2
Adding the following snippet to the above code will allow you to extract the data I suppose you want:
result (only the end)
I hope you don't plan to "play" trading on the Forex: it's one of the best way to loose money rapidly.
update 3
SORRY ! I forgot you are with Python 3. So I think you must define the regex like that:
that is to say with b before the string, otherwise you'll get an error like in this question
我怀疑正在发生的事情是服务器正在发送压缩数据而没有告诉您它正在这样做。 Python 的标准 HTTP 库无法处理压缩格式。
我建议使用 httplib2,它可以处理压缩格式(并且通常比 urllib 好得多)。
print(response)
向我们显示来自服务器的响应:{'状态':'200','内容长度':'7787','x-sid':'26,E','内容语言':'fr','set-cookie':'PHPSESSIONID = ed45f761542752317963ab4762ec604f;路径=/; domain=.www.boursorama.com', 'expires': '1981 年 11 月 19 日星期四 08:52:00 GMT', 'vary': '接受编码,用户代理', '服务器': 'nginx', '连接':'保持活动','-内容编码':'gzip', 'pragma': '无缓存', '缓存控制': '无存储、无缓存、必须重新验证、后检查=0、预检查=0', '日期': '星期二, 23 2011 年 8 月 10:26:46 GMT', '内容类型': 'text/html; charset=ISO-8859-1', 'content-location': 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'}
虽然这并不能确认它已被压缩(毕竟我们现在告诉服务器我们可以处理压缩),但它确实为该理论提供了一些依据。
您猜对了,实际内容位于
content
中。简单地看一下它就可以看出它正在工作(我只是粘贴一点点):b'
编辑:是的,这个确实创建了一个名为 .cache 的文件夹;我发现在使用 httplib2 时使用文件夹总是更好,并且之后您可以随时删除该文件夹。
What I suspect is happening is that the server is sending compressed data without telling you that it's doing so. Python's standard HTTP library can't handle compressed formats.
I suggest getting httplib2, which can handle compressed formats (and is generally much better than urllib).
print(response)
shows us the response from the server:{'status': '200', 'content-length': '7787', 'x-sid': '26,E', 'content-language': 'fr', 'set-cookie': 'PHPSESSIONID=ed45f761542752317963ab4762ec604f; path=/; domain=.www.boursorama.com', 'expires': 'Thu, 19 Nov 1981 08:52:00 GMT', 'vary': 'Accept-Encoding,User-Agent', 'server': 'nginx', 'connection': 'keep-alive', '-content-encoding': 'gzip', 'pragma': 'no-cache', 'cache-control': 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'date': 'Tue, 23 Aug 2011 10:26:46 GMT', 'content-type': 'text/html; charset=ISO-8859-1', 'content-location': 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'}
While this doesn't confirm that it was zipped (we're now telling the server that we can handle compressions, after all), it does lend some weight to the theory.
The actual content lives in, you guessed it,
content
. Looking at it briefly shows us that it's working (I'm just gonna paste a wee bit):b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"\n\t"http://
Edit: yes, this does create a folder named .cache; I've found that it's always better to work with folders when it comes to httplib2, and you can always delete the folder afterwards.
我已经使用 httplib2 并在终端上使用curl 测试了您的网址。两者都工作正常:
所以对我来说,要么 urllib.request 中存在错误,要么发生了非常奇怪的客户端-服务器交互。
I have tested your URL with the httplib2 and on the terminal with curl. Both work fine:
So to me, either there is a bug in urllib.request or there is really weird client-server interaction happening.