Python-无法将刮擦XML文件（从sec.gov）的结果转换为RAW字节

发布于 2025-02-07 04:56:00 字数 4077 浏览 1 评论 0原文

我正在尝试从sec.gov刮擦XML文件，然后将其转换为一个长字符串，但是它只是返回一堆字符串的一堆地址，我不知道如何将其作为字符串返回，或可以转换为字符串的对象。

例如，我只想要一个字符串形式：

<?xml version="1.0" encoding="ISO-8859-1" ?>
  <feed xmlns="http://www.w3.org/2005/Atom">
    <author>
      <email>[email protected]</email>
      <name>Webmaster</name>
    </author>
    <company-info>
      <addresses>
        <address type="mailing">
          <city>CHATHAM</city>
          <state>NJ</state>
          <street1>26 MAIN STREET, SUITE 101</street1>
          <zip>07928</zip>
        </address>
        <address type="business">
          <city>CHATHAM</city>

这是我的代码：

#!/usr/bin/python3

from lxml import html
from lxml import etree
import requests
from time import sleep
import json
import argparse
from random import randint
import sys
from urllib.parse import urlencode
from urllib.request import Request, urlopen
from pprint import pprint
import traceback
from xml.etree import ElementTree


def parse_finance_page(urlAddress):

  headers = {
          "Accept-Encoding":"gzip, deflate",
          "Accept-Language":"en-GB,en;q=0.9,en-US;q=0.8,ml;q=0.7",
          "Connection":"keep-alive",
          "Cache-Control":"no-store, no-cache, must-revalidate, max-age=0'",
          "Cache-Control":"post-check=0, pre-check=0",
          "Pragma":"no-cache",
          "Host":"www.sec.gov",
          "Referer":"https://www.sec.gov",
          "Upgrade-Insecure-Requests":"1",
          "User-Agent":"[email protected]"
    }

  for retries in range(5):
    try:

      request = Request("https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001430306&type=&dateb=&owner=include&start=0&count=40&output=atom", headers=headers)
      html = urlopen(request).read()

      print(html)
  
      xmlString = ElementTree.tostring(html, encoding='unicode')
      print(xmlString)

      quit()

第一个打印只是打印出看起来像一堆内存地址的字节字符串，例如：

b'\ x1f \ x8b \ x8b \ x08 \ x00 \ x00 \ x00 \ x00 \ x00 \ x00 \ x00 \ x03 \ xed \ x9dko \ xdb8 \ x16 \ x16 \ x86 \ xbf \ xbf \ xaf \ xaf \ xf \ xf2a0 x9d＆amp; 3 \ xbb \ x8b \ xc5b \ xa0 \ xd8t“ \ xd4 \ x96 \ x0cin \ x92 \ x92 \ xf9 h \ xbc \ x1c \ x9d \ xfc \ xfc0 \ x9d \ xa0 \ xaf，\ x8a \ xfd0 \ xe8 \ x1d \ x1d \ x91c | \ x84x0 \ x84x0 \ x0cg〜p \ xdb; xf4s \ xffo \ x08 \ x9d \ x8c \ x19 \ x1b！ x8c \ x8d \ xee \ t \ xa7gp \ x99w \ xf7 \ xe6 \ xc9] \ x18 \ xa5o \ xf8 [6 \ xf5 \ xfci \ xfci \ xfci \ xff \ x9e x9et \ xd3 \ x0f \ x16 \ xd5 \ x02o \ xca \ xfa \ xfa \ xffz \ xd4：\ xe9 \ x8a \ xf7 \ xf7 \ xe9 \ xe9 \ x01 \ x01 \ xbb \ xbb \ xeb \ xebg＆lt; c <代码> \ x1c。\ xbf \ xed \ x8df \ x11 \ x8bc \ x16/jve（y \ x9c \ xb1 \ xde \ xde \ x11 \ x11 \ xfc \ x18？ r \ xae \ xdf \ rn \ xba \ Xe2 \ xdd \ xfa \ xc7q \ xc7q \ xe2％\ xac \ x7f \ x7f \ xf9 \ xcbi7} \ xf5 \ xf5 \ xf5 \ xf5 \ xb3 \ xb3 \ x88 \ x88 \ x88 \ xb1 \ x89 \ x84 \ x84 \ x84 \ x84 \ x84 \ x84 \ x f \ Xe7 \ X97 \ Xe8 \ Xea \ Xfa \ Xf3 \ Xd9 \ Xd9 \ Xd9 \ Xf5+T \ Xf5 \ Xdb \ Xdb \ Xf9 \ Xf9 \ Xf5 \ X19 \ X19“ \ xb1 \ xe5p \ xfb \ xa4 \ x0b/w \ xad \ xedf \ xcd} \ xf6 \ xf6 \ x04n \ xe6 \ xb1 \ xb1 \ x1f \ x1f \ x1f \ xf \ xf0 \ xb7 \ xed86 \ xee8 \ xc40n \ xbaiys \ xcesy \ xb0 \ xea \ xbb \ xbb \ x13/\ x8e \ xfd \ xfd \ xdb \ x80 \ x80 \ x80 \ x80 \ x8d：\ xb1：\ xb1？ \ x8b \ x87 \ xfdo \ xef \ x06 \ x9f/\ x06 \ xa7g \ xbf \ xbf] \ x9f \ x9f \ x9f \ x0e＆gt; \ xa0o \ x90o \ x9f \ x9f \ x9f

而且它不断地...

线：

xmlString = ElementTree.tostring(html, encoding='unicode')
print(xmlString)

只是生产：

Failed to process the request, Exception:'bytes' object has no attribute 'iter'

任何帮助都将受到极大的赞赏。

原文

I am trying to scrape an XML file from sec.gov and just convert it to one long string, but it just returns a byte string of bunch of addresses, I don't know how to get it to just come back as a string, or an object that I can convert to a string.

like for example I just want a string form of this:

<?xml version="1.0" encoding="ISO-8859-1" ?>
  <feed xmlns="http://www.w3.org/2005/Atom">
    <author>
      <email>[email protected]</email>
      <name>Webmaster</name>
    </author>
    <company-info>
      <addresses>
        <address type="mailing">
          <city>CHATHAM</city>
          <state>NJ</state>
          <street1>26 MAIN STREET, SUITE 101</street1>
          <zip>07928</zip>
        </address>
        <address type="business">
          <city>CHATHAM</city>

Here is my code:

#!/usr/bin/python3

from lxml import html
from lxml import etree
import requests
from time import sleep
import json
import argparse
from random import randint
import sys
from urllib.parse import urlencode
from urllib.request import Request, urlopen
from pprint import pprint
import traceback
from xml.etree import ElementTree


def parse_finance_page(urlAddress):

  headers = {
          "Accept-Encoding":"gzip, deflate",
          "Accept-Language":"en-GB,en;q=0.9,en-US;q=0.8,ml;q=0.7",
          "Connection":"keep-alive",
          "Cache-Control":"no-store, no-cache, must-revalidate, max-age=0'",
          "Cache-Control":"post-check=0, pre-check=0",
          "Pragma":"no-cache",
          "Host":"www.sec.gov",
          "Referer":"https://www.sec.gov",
          "Upgrade-Insecure-Requests":"1",
          "User-Agent":"[email protected]"
    }

  for retries in range(5):
    try:

      request = Request("https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001430306&type=&dateb=&owner=include&start=0&count=40&output=atom", headers=headers)
      html = urlopen(request).read()

      print(html)
  
      xmlString = ElementTree.tostring(html, encoding='unicode')
      print(xmlString)

      quit()

The first print just prints out what looks like a byte string of a bunch of memory addresses, for example:

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\x9dko\xdb8\x16\x86\xbf\xef\xaf \xf2a0\x83\xadc\x92\xba\xab\xb1g\xbdi\x8af\xda\xa4\x9d&3\xbb\x8b\xc5b\xa0\xd8t"\xd4\x96\x0cIn\x92\xf9\xf5\xcbC\xc9\x97X\x8adMMZq\x05\x14\xa9M\xd32ux\xce\xfbH\xbc\x1c\x9d\xfc\xfc0\x9d\xa0\xaf,\x8a\xfd0\xe8\x1d\x91c|\x84X0\x0cG~p\xdb;:\xbf\xfa\xd8\xb1m\xc3\xe9\x90#\xf4s\xffo\x08\x9d\x8c\x19\x1b!\xfe\x95 \xee\x1d\xdd%\xc9\xcc\xedv\xef\xef\xef\x8f\xef\xb5\xe30\xba\xedR\x8c\x8d\xee \t\xa7GP\x99W\xf7\xe6\xc9]\x18\xa5o\xf8[6\xf5\xfcI\xff\x9e\xddL\xbd8a\xd1?b6<\xbe\r\xbf\x9et\xd3\x0f\x16\xd5\x02o\xca\xfa\xffZ\xd4:\xe9\x8a\xf7\xe9\x01\xbb\xebG<\x19\x86\xd3\x99\x17<v\xfc\x1c.\xbf\xed\x8dF\x11\x8bc\x16/JVe(y\x9c\xb1\xde\x11\xfc\x18?\xbf\xa3U\x058\x96\x9f<\xf6O\xdf\r\xae\xdf\r.N\xba\xe2\xdd\xfa\xc7q\xe2%\xac\x7f\xf9\xcbI7}\xf5\xf4\xb3\x88\xb1\x84\xf4\xa9\x89.\x06\xe7\x97\xe8\xea\xfa\xf3\xd9\xd9\xf5+t\xf5\xdb\xf9\xf5\x19"\x98\xc0\x97\xd2*\xeb_\xfb\xd3\x9f\xf5\xb1\xe5P\xfb\xa4\x0b/W\xad\xedf\xcd}\xf6\x04n\xe6\xb1\x1f\xf0\xb7\xb5\xcev\x17\x06\xacO\t\xed86\xee8\xc40N\xbaiYS\xcesY\xb0\xea\xbb\x13/\x8e\xfd\xdb\x80\x8d:\xb1?\xecS[\xd3y\xa5\xf5\xa2\xa2z\x9d\x11\x8b\x87\xfdO\xef\x06\x9f/\x06\xa7g\xbf]\x9f\x9f\x0e>\xa0O\x9f\xcf>\r>\x0f\xae\xcf?

And it goes on and on and on...

The lines:

xmlString = ElementTree.tostring(html, encoding='unicode')
print(xmlString)

Just produce:

Failed to process the request, Exception:'bytes' object has no attribute 'iter'

Any help is greatly appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

关于作者

記柔刀

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

Python-无法将刮擦XML文件（从sec.gov）的结果转换为RAW字节

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

关于作者

相关话题

热门标签

推荐作者

Mr.HU

疯到世界奔溃

隔纱相望

萌无敌

梦幻的味道

自在安然

友情链接

Python-无法将刮擦XML文件（从sec.gov）的结果转换为RAW字节

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

关于作者

相关话题

热门标签

推荐作者

Mr.HU

疯到世界奔溃

隔纱相望

萌无敌

梦幻的味道

自在安然

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。