python 中的网页抓取 urlopen

发布于 2024-12-01 02:12:15 字数 531 浏览 1 评论 0原文

我正在尝试从该网站获取数据： http://www.boursorama.com/includes/cours/last_transactions.phtml ?symbole=1xEURUS

urlopen 似乎没有获取 html 代码，我不明白为什么。它是这样的：

html = urllib.request.urlopen("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS")
print (html)

我的代码是正确的，我用相同的代码获取了其他网页的html源，但它似乎无法识别这个地址。

它打印： b''

也许另一个库更合适？为什么urlopen不返回网页的html代码？帮忙谢谢！

原文

I am trying to get the data from this website:
http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS

It seems like urlopen don't get the html code and I don't understand why.
It goes like:

html = urllib.request.urlopen("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS")
print (html)

My code is right, I get the html source of other webpages with the same code, but it seems like it doesn't recognise this address.

it prints: b''

Maybe another library is more appropriate? Why urlopen doesn't return the html code of the webpage?
help thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

够钟 2024-12-08 02:12:15

就我个人而言，我写道：

# Python 2.7

import urllib

url = 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'
sock = urllib.urlopen(url)
content = sock.read() 
sock.close()

print content

Et si tu parles français,.. bonjour sur stackoverflow.com ！

更新 1

事实上，我现在更喜欢使用以下代码，因为它更快：

# Python 2.7

import httplib

conn = httplib.HTTPConnection(host='www.boursorama.com',timeout=30)

req = '/includes/cours/last_transactions.phtml?symbole=1xEURUS'

try:
    conn.request('GET',req)
except:
     print 'echec de connexion'

content = conn.getresponse().read()

print content

在此代码中将 httplib 更改为 http.client 应该足以使其适应 Python 3

我确认，通过这两个代码，我获得了源代码，其中我看到了您感兴趣的数据：

        <td class="L20" width="33%" align="center">11:57:44</td>

        <td class="L20" width="33%" align="center">1.4486</td>

        <td class="L20" width="33%" align="center">0</td>

</tr>

                                        <tr>

        <td  width="33%" align="center">11:57:43</td>

        <td  width="33%" align="center">1.4486</td>

        <td  width="33%" align="center">0</td>

</tr>

update 2

将以下代码片段添加到上述代码中将允许您提取我认为您想要的数据：

for i,line in enumerate(content.splitlines(True)):
    print str(i)+' '+repr(line)

print '\n\n'


import re

regx = re.compile('\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">(\d\d:\d\d:\d\d)</td>\r\n'
                  '\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">([\d.]+)</td>\r\n'
                  '\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">(\d+)</td>\r\n')

print regx.findall(content)

result (只是最后）

.......................................
.......................................
.......................................
.......................................
98 'window.config.graphics = {};\n'
99 'window.config.accordions = {};\n'
100 '\n'
101 "window.addEvent('domready', function(){\n"
102 '});\n'
103 '</script>\n'
104 '<script type="text/javascript">\n'
105 '\t\t\t\tsas_tmstp = Math.round(Math.random()*10000000000);\n'
106 '\t\t\t\tsas_pageid = "177/(includes/cours/last_transactions)"; // Page : boursorama.com/smartad_test\n'
107 '\t\t\t\tvar sas_formatids = "8968";\n'
108 '\t\t\t\tsas_target = "symb=1xEURUS#"; // TargetingArray\n'
109 '\t\t\t\tdocument.write("<scr"+"ipt src=\\"http://ads.boursorama.com/call2/pubjall/" + sas_pageid + "/" + sas_formatids + "/" + sas_tmstp + "/" + escape(sas_target) + "?\\"></scr"+"ipt>");\t\t\t\t\n'
110 '\t\t\t</script><div id="_smart1"><script language="javascript">sas_script(1,8968);</script></div><script type="text/javascript">\r\n'
111 "\twindow.addEvent('domready', function(){\r\n"
112 'sas_move(1,8968);\t});\r\n'
113 '</script>\n'
114 '<script type="text/javascript">\n'
115 'var _gaq = _gaq || [];\n'
116 "_gaq.push(['_setAccount', 'UA-1623710-1']);\n"
117 "_gaq.push(['_setDomainName', 'www.boursorama.com']);\n"
118 "_gaq.push(['_setCustomVar', 1, 'segment', 'WEB-VISITOR']);\n"
119 "_gaq.push(['_setCustomVar', 4, 'version', '18']);\n"
120 "_gaq.push(['_trackPageLoadTime']);\n"
121 "_gaq.push(['_trackPageview']);\n"
122 '(function() {\n'
123 "var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;\n"
124 "ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';\n"
125 "var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);\n"
126 '})();\n'
127 '</script>\n'
128 '</body>\n'
129 '</html>'



[('12:25:36', '1.4478', '0'), ('12:25:33', '1.4478', '0'), ('12:25:31', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:29', '1.4478', '0')]

我希望您不打算在外汇上“玩”交易：这是快速亏损的最佳方式之一。

更新3

抱歉！我忘了你使用的是 Python 3。所以我认为你必须像这样定义正则表达式：

regx = re.compile(b'\t\t\t\t\t......)

也就是说在字符串前加上b，否则您会收到类似这个问题的错误

Personally , I write:

# Python 2.7

import urllib

url = 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'
sock = urllib.urlopen(url)
content = sock.read() 
sock.close()

print content

Et si tu parles français,.. bonjour sur stackoverflow.com !

update 1

In fact, I prefer now to employ the following code, because it is faster:

# Python 2.7

import httplib

conn = httplib.HTTPConnection(host='www.boursorama.com',timeout=30)

req = '/includes/cours/last_transactions.phtml?symbole=1xEURUS'

try:
    conn.request('GET',req)
except:
     print 'echec de connexion'

content = conn.getresponse().read()

print content

Changing httplib to http.client in this code should be enough to adapt it to Python 3.

I confirm that, with these two codes, I obtain the source code in which I see the data in which you are interested:

        <td class="L20" width="33%" align="center">11:57:44</td>

        <td class="L20" width="33%" align="center">1.4486</td>

        <td class="L20" width="33%" align="center">0</td>

</tr>

                                        <tr>

        <td  width="33%" align="center">11:57:43</td>

        <td  width="33%" align="center">1.4486</td>

        <td  width="33%" align="center">0</td>

</tr>

update 2

Adding the following snippet to the above code will allow you to extract the data I suppose you want:

for i,line in enumerate(content.splitlines(True)):
    print str(i)+' '+repr(line)

print '\n\n'


import re

regx = re.compile('\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">(\d\d:\d\d:\d\d)</td>\r\n'
                  '\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">([\d.]+)</td>\r\n'
                  '\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">(\d+)</td>\r\n')

print regx.findall(content)

result (only the end)

.......................................
.......................................
.......................................
.......................................
98 'window.config.graphics = {};\n'
99 'window.config.accordions = {};\n'
100 '\n'
101 "window.addEvent('domready', function(){\n"
102 '});\n'
103 '</script>\n'
104 '<script type="text/javascript">\n'
105 '\t\t\t\tsas_tmstp = Math.round(Math.random()*10000000000);\n'
106 '\t\t\t\tsas_pageid = "177/(includes/cours/last_transactions)"; // Page : boursorama.com/smartad_test\n'
107 '\t\t\t\tvar sas_formatids = "8968";\n'
108 '\t\t\t\tsas_target = "symb=1xEURUS#"; // TargetingArray\n'
109 '\t\t\t\tdocument.write("<scr"+"ipt src=\\"http://ads.boursorama.com/call2/pubjall/" + sas_pageid + "/" + sas_formatids + "/" + sas_tmstp + "/" + escape(sas_target) + "?\\"></scr"+"ipt>");\t\t\t\t\n'
110 '\t\t\t</script><div id="_smart1"><script language="javascript">sas_script(1,8968);</script></div><script type="text/javascript">\r\n'
111 "\twindow.addEvent('domready', function(){\r\n"
112 'sas_move(1,8968);\t});\r\n'
113 '</script>\n'
114 '<script type="text/javascript">\n'
115 'var _gaq = _gaq || [];\n'
116 "_gaq.push(['_setAccount', 'UA-1623710-1']);\n"
117 "_gaq.push(['_setDomainName', 'www.boursorama.com']);\n"
118 "_gaq.push(['_setCustomVar', 1, 'segment', 'WEB-VISITOR']);\n"
119 "_gaq.push(['_setCustomVar', 4, 'version', '18']);\n"
120 "_gaq.push(['_trackPageLoadTime']);\n"
121 "_gaq.push(['_trackPageview']);\n"
122 '(function() {\n'
123 "var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;\n"
124 "ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';\n"
125 "var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);\n"
126 '})();\n'
127 '</script>\n'
128 '</body>\n'
129 '</html>'



[('12:25:36', '1.4478', '0'), ('12:25:33', '1.4478', '0'), ('12:25:31', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:29', '1.4478', '0')]

I hope you don't plan to "play" trading on the Forex: it's one of the best way to loose money rapidly.

update 3

SORRY ! I forgot you are with Python 3. So I think you must define the regex like that:

regx = re.compile(b'\t\t\t\t\t......)

that is to say with b before the string, otherwise you'll get an error like in this question

回复收藏 0 原文

双手揣兜 2024-12-08 02:12:15

我怀疑正在发生的事情是服务器正在发送压缩数据而没有告诉您它正在这样做。 Python 的标准 HTTP 库无法处理压缩格式。
我建议使用 httplib2，它可以处理压缩格式（并且通常比 urllib 好得多）。

import httplib2
folder = httplib2.Http('.cache')
response, content = folder.request("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS")

print(response) 向我们显示来自服务器的响应：
{'状态'：'200'，'内容长度'：'7787'，'x-sid'：'26，E'，'内容语言'：'fr'，'set-cookie'：'PHPSESSIONID = ed45f761542752317963ab4762ec604f;路径=/； domain=.www.boursorama.com', 'expires': '1981 年 11 月 19 日星期四 08:52:00 GMT', 'vary': '接受编码，用户代理', '服务器': 'nginx', '连接'：'保持活动'，'-内容编码'：'gzip'， 'pragma': '无缓存', '缓存控制': '无存储、无缓存、必须重新验证、后检查=0、预检查=0', '日期': '星期二, 23 2011 年 8 月 10:26:46 GMT', '内容类型': 'text/html; charset=ISO-8859-1', 'content-location': 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'}

虽然这并不能确认它已被压缩（毕竟我们现在告诉服务器我们可以处理压缩），但它确实为该理论提供了一些依据。

您猜对了，实际内容位于 content 中。简单地看一下它就可以看出它正在工作（我只是粘贴一点点）：
b'

编辑：是的，这个确实创建了一个名为 .cache 的文件夹；我发现在使用 httplib2 时使用文件夹总是更好，并且之后您可以随时删除该文件夹。

What I suspect is happening is that the server is sending compressed data without telling you that it's doing so. Python's standard HTTP library can't handle compressed formats.
I suggest getting httplib2, which can handle compressed formats (and is generally much better than urllib).

import httplib2
folder = httplib2.Http('.cache')
response, content = folder.request("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS")

print(response) shows us the response from the server:
{'status': '200', 'content-length': '7787', 'x-sid': '26,E', 'content-language': 'fr', 'set-cookie': 'PHPSESSIONID=ed45f761542752317963ab4762ec604f; path=/; domain=.www.boursorama.com', 'expires': 'Thu, 19 Nov 1981 08:52:00 GMT', 'vary': 'Accept-Encoding,User-Agent', 'server': 'nginx', 'connection': 'keep-alive', '-content-encoding': 'gzip', 'pragma': 'no-cache', 'cache-control': 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'date': 'Tue, 23 Aug 2011 10:26:46 GMT', 'content-type': 'text/html; charset=ISO-8859-1', 'content-location': 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'}

While this doesn't confirm that it was zipped (we're now telling the server that we can handle compressions, after all), it does lend some weight to the theory.

The actual content lives in, you guessed it, content. Looking at it briefly shows us that it's working (I'm just gonna paste a wee bit):
b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"\n\t"http://

Edit: yes, this does create a folder named .cache; I've found that it's always better to work with folders when it comes to httplib2, and you can always delete the folder afterwards.

回复收藏 0 原文

慈悲佛祖 2024-12-08 02:12:15

我已经使用 httplib2 并在终端上使用curl 测试了您的网址。两者都工作正常：

URL = "http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS"
h = httplib2.Http()
resp, content = h.request(URL, "GET")
print(content)

所以对我来说，要么 urllib.request 中存在错误，要么发生了非常奇怪的客户端-服务器交互。

I have tested your URL with the httplib2 and on the terminal with curl. Both work fine:

URL = "http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS"
h = httplib2.Http()
resp, content = h.request(URL, "GET")
print(content)

So to me, either there is a bug in urllib.request or there is really weird client-server interaction happening.

回复收藏 0 原文

~没有更多了~