Python-无法将刮擦XML文件(从sec.gov)的结果转换为RAW字节
我正在尝试从sec.gov刮擦XML文件,然后将其转换为一个长字符串,但是它只是返回一堆字符串的一堆地址,我不知道如何将其作为字符串返回,或可以转换为字符串的对象。
例如,我只想要一个字符串形式:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<feed xmlns="http://www.w3.org/2005/Atom">
<author>
<email>[email protected]</email>
<name>Webmaster</name>
</author>
<company-info>
<addresses>
<address type="mailing">
<city>CHATHAM</city>
<state>NJ</state>
<street1>26 MAIN STREET, SUITE 101</street1>
<zip>07928</zip>
</address>
<address type="business">
<city>CHATHAM</city>
这是我的代码:
#!/usr/bin/python3
from lxml import html
from lxml import etree
import requests
from time import sleep
import json
import argparse
from random import randint
import sys
from urllib.parse import urlencode
from urllib.request import Request, urlopen
from pprint import pprint
import traceback
from xml.etree import ElementTree
def parse_finance_page(urlAddress):
headers = {
"Accept-Encoding":"gzip, deflate",
"Accept-Language":"en-GB,en;q=0.9,en-US;q=0.8,ml;q=0.7",
"Connection":"keep-alive",
"Cache-Control":"no-store, no-cache, must-revalidate, max-age=0'",
"Cache-Control":"post-check=0, pre-check=0",
"Pragma":"no-cache",
"Host":"www.sec.gov",
"Referer":"https://www.sec.gov",
"Upgrade-Insecure-Requests":"1",
"User-Agent":"[email protected]"
}
for retries in range(5):
try:
request = Request("https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001430306&type=&dateb=&owner=include&start=0&count=40&output=atom", headers=headers)
html = urlopen(request).read()
print(html)
xmlString = ElementTree.tostring(html, encoding='unicode')
print(xmlString)
quit()
第一个打印只是打印出看起来像一堆内存地址的字节字符串,例如:
b'\ x1f \ x8b \ x8b \ x08 \ x00 \ x00 \ x00 \ x00 \ x00 \ x00 \ x00 \ x03 \ xed \ x9dko \ xdb8 \ x16 \ x16 \ x86 \ xbf \ xbf \ xaf \ xaf \ xf \ xf2a0 x9d&amp; 3 \ xbb \ x8b \ xc5b \ xa0 \ xd8t“ \ xd4 \ x96 \ x0cin \ x92 \ x92 \ xf9 h \ xbc \ x1c \ x9d \ xfc \ xfc0 \ x9d \ xa0 \ xaf,\ x8a \ xfd0 \ xe8 \ x1d \ x1d \ x91c | \ x84x0 \ x84x0 \ x0cg〜p \ xdb; xf4s \ xffo \ x08 \ x9d \ x8c \ x19 \ x1b! x8c \ x8d \ xee \ t \ xa7gp \ x99w \ xf7 \ xe6 \ xc9] \ x18 \ xa5o \ xf8 [6 \ xf5 \ xfci \ xfci \ xfci \ xff \ x9e x9et \ xd3 \ x0f \ x16 \ xd5 \ x02o \ xca \ xfa \ xfa \ xffz \ xd4:\ xe9 \ x8a \ xf7 \ xf7 \ xe9 \ xe9 \ x01 \ x01 \ xbb \ xbb \ xeb \ xebg&lt; c <代码> \ x1c。\ xbf \ xed \ x8df \ x11 \ x8bc \ x16/jve(y \ x9c \ xb1 \ xde \ xde \ x11 \ x11 \ xfc \ x18? r \ xae \ xdf \ rn \ xba \ Xe2 \ xdd \ xfa \ xc7q \ xc7q \ xe2%\ xac \ x7f \ x7f \ xf9 \ xcbi7} \ xf5 \ xf5 \ xf5 \ xf5 \ xb3 \ xb3 \ x88 \ x88 \ x88 \ xb1 \ x89 \ x84 \ x84 \ x84 \ x84 \ x84 \ x84 \ x f \ Xe7 \ X97 \ Xe8 \ Xea \ Xfa \ Xf3 \ Xd9 \ Xd9 \ Xd9 \ Xf5+T \ Xf5 \ Xdb \ Xdb \ Xf9 \ Xf9 \ Xf5 \ X19 \ X19“ \ xb1 \ xe5p \ xfb \ xa4 \ x0b/w \ xad \ xedf \ xcd} \ xf6 \ xf6 \ x04n \ xe6 \ xb1 \ xb1 \ x1f \ x1f \ x1f \ xf \ xf0 \ xb7 \ xed86 \ xee8 \ xc40n \ xbaiys \ xcesy \ xb0 \ xea \ xbb \ xbb \ x13/\ x8e \ xfd \ xfd \ xdb \ x80 \ x80 \ x80 \ x80 \ x8d:\ xb1:\ xb1? \ x8b \ x87 \ xfdo \ xef \ x06 \ x9f/\ x06 \ xa7g \ xbf \ xbf] \ x9f \ x9f \ x9f \ x0e&gt; \ xa0o \ x90o \ x9f \ x9f \ x9f
而且它不断地...
线:
xmlString = ElementTree.tostring(html, encoding='unicode')
print(xmlString)
只是生产:
Failed to process the request, Exception:'bytes' object has no attribute 'iter'
任何帮助都将受到极大的赞赏。
I am trying to scrape an XML file from sec.gov and just convert it to one long string, but it just returns a byte string of bunch of addresses, I don't know how to get it to just come back as a string, or an object that I can convert to a string.
like for example I just want a string form of this:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<feed xmlns="http://www.w3.org/2005/Atom">
<author>
<email>[email protected]</email>
<name>Webmaster</name>
</author>
<company-info>
<addresses>
<address type="mailing">
<city>CHATHAM</city>
<state>NJ</state>
<street1>26 MAIN STREET, SUITE 101</street1>
<zip>07928</zip>
</address>
<address type="business">
<city>CHATHAM</city>
Here is my code:
#!/usr/bin/python3
from lxml import html
from lxml import etree
import requests
from time import sleep
import json
import argparse
from random import randint
import sys
from urllib.parse import urlencode
from urllib.request import Request, urlopen
from pprint import pprint
import traceback
from xml.etree import ElementTree
def parse_finance_page(urlAddress):
headers = {
"Accept-Encoding":"gzip, deflate",
"Accept-Language":"en-GB,en;q=0.9,en-US;q=0.8,ml;q=0.7",
"Connection":"keep-alive",
"Cache-Control":"no-store, no-cache, must-revalidate, max-age=0'",
"Cache-Control":"post-check=0, pre-check=0",
"Pragma":"no-cache",
"Host":"www.sec.gov",
"Referer":"https://www.sec.gov",
"Upgrade-Insecure-Requests":"1",
"User-Agent":"[email protected]"
}
for retries in range(5):
try:
request = Request("https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001430306&type=&dateb=&owner=include&start=0&count=40&output=atom", headers=headers)
html = urlopen(request).read()
print(html)
xmlString = ElementTree.tostring(html, encoding='unicode')
print(xmlString)
quit()
The first print just prints out what looks like a byte string of a bunch of memory addresses, for example:
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\x9dko\xdb8\x16\x86\xbf\xef\xaf \xf2a0\x83\xadc\x92\xba\xab\xb1g\xbdi\x8af\xda\xa4\x9d&3\xbb\x8b\xc5b\xa0\xd8t"\xd4\x96\x0cIn\x92\xf9\xf5\xcbC\xc9\x97X\x8adMMZq\x05\x14\xa9M\xd32ux\xce\xfbH\xbc\x1c\x9d\xfc\xfc0\x9d\xa0\xaf,\x8a\xfd0\xe8\x1d\x91c|\x84X0\x0cG~p\xdb;:\xbf\xfa\xd8\xb1m\xc3\xe9\x90#\xf4s\xffo\x08\x9d\x8c\x19\x1b!\xfe\x95 \xee\x1d\xdd%\xc9\xcc\xedv\xef\xef\xef\x8f\xef\xb5\xe30\xba\xedR\x8c\x8d\xee \t\xa7GP\x99W\xf7\xe6\xc9]\x18\xa5o\xf8[6\xf5\xfcI\xff\x9e\xddL\xbd8a\xd1?b6<\xbe\r\xbf\x9et\xd3\x0f\x16\xd5\x02o\xca\xfa\xffZ\xd4:\xe9\x8a\xf7\xe9\x01\xbb\xebG<\x19\x86\xd3\x99\x17<v\xfc\x1c.\xbf\xed\x8dF\x11\x8bc\x16/JVe(y\x9c\xb1\xde\x11\xfc\x18?\xbf\xa3U\x058\x96\x9f<\xf6O\xdf\r\xae\xdf\r.N\xba\xe2\xdd\xfa\xc7q\xe2%\xac\x7f\xf9\xcbI7}\xf5\xf4\xb3\x88\xb1\x84\xf4\xa9\x89.\x06\xe7\x97\xe8\xea\xfa\xf3\xd9\xd9\xf5+t\xf5\xdb\xf9\xf5\x19"\x98\xc0\x97\xd2*\xeb_\xfb\xd3\x9f\xf5\xb1\xe5P\xfb\xa4\x0b/W\xad\xedf\xcd}\xf6\x04n\xe6\xb1\x1f\xf0\xb7\xb5\xce
v\x17\x06\xacO\t\xed86\xee8\xc40N\xbaiYS\xcesY\xb0\xea\xbb\x13/\x8e\xfd\xdb\x80\x8d:\xb1?\xecS[\xd3y\xa5\xf5\xa2\xa2z\x9d\x11\x8b\x87\xfdO\xef\x06\x9f/\x06\xa7g\xbf]\x9f\x9f\x0e>\xa0O\x9f\xcf>\r>\x0f\xae\xcf?
And it goes on and on and on...
The lines:
xmlString = ElementTree.tostring(html, encoding='unicode')
print(xmlString)
Just produce:
Failed to process the request, Exception:'bytes' object has no attribute 'iter'
Any help is greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论