Python 3 Urlopen 在下载的文本文件中添加 \xef\xbb\xbf 前缀
我正在编写一个脚本来下载和处理历史股票价格。 当我使用 urllib.request.urlopen 时,我在每个文件 (b'\xef\xbb\xbf) 中都得到了一个奇怪的文本前缀,当我使用 urllib.request.urlretrieve 时,该前缀不存在,当我将 url 输入到浏览器(火狐)。 所以我有一个答案,但我不知道为什么它首先引起了问题。我怀疑这可能是因为我强迫它成为一个字符串,但我不知道为什么会这样,也不知道我将如何解决这个问题(除了使用 urlretrieve 代替)。 代码如下。相关行是第11行。后面的注释代码是我使用orlopen时的。
#download a bunch of historical stock quotes from google finance
import urllib.request
symbolarray = []
symbolfile = open("symbols.txt")
for line in symbolfile:
symbolarray.append(line.strip())
symbolfile.close()
for symbol in symbolarray:
page = urllib.request.urlretrieve("http://www.google.com/finance/historical?q=NYSE:"+symbol+"&output=csv",symbol+".csv")
#datafile = open(symbol+".csv","w")
#datafile.write(str(page.read()))
#datafile.close()
I am working on a script to download and process historical stock prices.
When I used urllib.request.urlopen I got a strange prefix of text in every file (b'\xef\xbb\xbf) that was not present when I used urllib.request.urlretrieve, nor present when I typed the url into a browser (Firefox).
So I have an answer but I don't know why it was causing a problem in the first place. I suspect it may be because I forced it to be a string, but I don't know why that is or how I would work around that (other than to use urlretrieve instead).
The code is below. The relevant line is line 11. The commented code after is when I was using orlopen.
#download a bunch of historical stock quotes from google finance
import urllib.request
symbolarray = []
symbolfile = open("symbols.txt")
for line in symbolfile:
symbolarray.append(line.strip())
symbolfile.close()
for symbol in symbolarray:
page = urllib.request.urlretrieve("http://www.google.com/finance/historical?q=NYSE:"+symbol+"&output=csv",symbol+".csv")
#datafile = open(symbol+".csv","w")
#datafile.write(str(page.read()))
#datafile.close()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
0xEF、0xBB、0xBF 是 utf-8 的 BOM。它表示这是一个utf-8编码的字符串。我猜如果你使用wireshark,你会发现它一直都在那里。只是大多数程序都忽略了它。
如果您想删除 BOM,您应该尝试
page.read().decode('utf-8-sig')
而不是str(page.read())
。如果你想保留它,你可以只使用“utf-8”进行解码。0xEF,0xBB,0xBF is the BOM for utf-8. It signifies that this is a utf-8 encoded string. I'm guessing that if you use wireshark you'll see that it was there all along. It's just that most programs ignore it.
Instead of
str(page.read())
you should trypage.read().decode('utf-8-sig')
if you want to remove the BOM. If you want to keep it you can decode just with 'utf-8'.