当 XML 属性包含 unicode 时 xml.dom.minidom.parse() 失败

发布于 2024-11-05 12:03:49 字数 2226 浏览 1 评论 0原文

我正在使用 urllib2.request 查询 Web 服务并接收 XML。如果我违反了 Web 服务的速率限制(1 次调用/秒),我会收到返回的 HTML,说明我已违反了速率限制。

尽管我可以在每次调用后 time.sleep() 2-3 秒,但无论出于何种原因,我仍然违反了速率限制。

为了测试我的响应是 XML 还是 HTML,我使用 xml.dom.minidom() 然后测试是否存在 html 元素

try:
    dom = xml.dom.minidom.parseString(response_text)
  except xml.parsers.expat.ExpatError:
    return False

  if len(dom.getElementsByTagName('html')) == 0:
    return True
  else:
    return False

这完成了工作,但我遇到了其中一个 XML 的情况属性包含 XML。在这种情况下,parseString() 命令会失败,并显示

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/python/default-2.6/lib/python2.6/xml/dom/minidom.py", line 1918, in     parse
    return expatbuilder.parse(file)
  File "/opt/python/default-2.6/lib/python2.6/xml/dom/expatbuilder.py", line 924, in parse
    result = builder.parseFile(fp)
  File "/opt/python/default-2.6/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile
    parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 3125

In this case, column 3125 is part of some attribute value text that contains ampersand-pound-x-9 (Stackoverflow is hide my unicode)。

xml.dom.minidom 应该能够处理这个问题吗?除此之外,XML 是否还有其他问题导致解析失败?

此外,如果社区有其他处理此类情况的方法,我也持开放态度。

如果有帮助的话,以下是当我违反其速率限制时网络服务返回的内容:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="eng">
    <head>
        <title>Service Temporarily Unavailable - Rate Limited</title>
    </head> 
    <body style="text-align:center;background-color:white;"> 
        <h1>Service Temporarily Unavailable</h1>
        <hr />
        <div>
            You have used this service too often in a short time.  Please wait before using this service again.
            <br/><br/>
            Please visit the <a href="http://wiki.xxxx.com/index.php?title=API_Usage">wiki</a> for more details.
        </div> 
    </body> 
</html>

I'm querying a web service using urllib2.request and receiving XML. If I violate the web service's rate limit (1 call/second), I receive HTML back saying I've violated the rate limit.

Even though I can time.sleep() for 2-3 seconds after each call, I still, for whatever reason, violate the rate limit.

To test that my response is either XML or HTML, I'm using xml.dom.minidom() and then testing for the presence of an html element

try:
    dom = xml.dom.minidom.parseString(response_text)
  except xml.parsers.expat.ExpatError:
    return False

  if len(dom.getElementsByTagName('html')) == 0:
    return True
  else:
    return False

This gets the job done but I've run into a case where one of the XML attributes contains XML. In that case, the parseString() command fails with

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/python/default-2.6/lib/python2.6/xml/dom/minidom.py", line 1918, in     parse
    return expatbuilder.parse(file)
  File "/opt/python/default-2.6/lib/python2.6/xml/dom/expatbuilder.py", line 924, in parse
    result = builder.parseFile(fp)
  File "/opt/python/default-2.6/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile
    parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 3125

In this case, column 3125 is part of some attribute value text that contains ampersand-pound-x-9 (Stackoverflow is hiding my unicode).

Should xml.dom.minidom be able to handle this? Could there be another issue with the XML besides this that's causing the parsing to fail?

Additionally, I'm open to other ways of handling this type of situation if the community has one.

If it helps, here is what the web service returns when I've violated their rate limit:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="eng">
    <head>
        <title>Service Temporarily Unavailable - Rate Limited</title>
    </head> 
    <body style="text-align:center;background-color:white;"> 
        <h1>Service Temporarily Unavailable</h1>
        <hr />
        <div>
            You have used this service too often in a short time.  Please wait before using this service again.
            <br/><br/>
            Please visit the <a href="http://wiki.xxxx.com/index.php?title=API_Usage">wiki</a> for more details.
        </div> 
    </body> 
</html>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

把梦留给海 2024-11-12 12:03:49

我认为 是一个选项卡。您应该尝试 http://docs.python.org/library/htmllib.html# module-htmlentitydefs 将特殊的 html 实体转换回它们本来的样子。 (这可能有 < 等问题)。或者您可以进行字符串替换,用空格替换

作为一个建议,当您正在解析内容并且解析器遇到问题(例如不适合您的模式)时,您应该允许解析器继续,但发出警告,而不是停止操作。通过这种方式,您可以看到问题所在,并可能纠正它,或者至少看到存在问题。

另外,关于速率限制的问题,为什么不缓存一次请求的 HTML,以便您可以在本地执行处理。

I think that is a tab. You should try http://docs.python.org/library/htmllib.html#module-htmlentitydefs to convert special html entities back to whatever they are. (That may have the problem of < etc). Or you could do a string substitution that substitute with a space.

Just as a suggestion, when you're parsing stuff, and the parser runs into a problem, such as not fitting your pattern, instead of stopping the operation, you should allow the parser to continue, but spit out a warning. This way you can see what the problem is, and potentially correct it, or at least see that there's a problem.

Also as to your problem with the rate limit, why not cache the requested HTML once so you can perform processing locally.

年少掌心 2024-11-12 12:03:49

您还可以在尝试解析结果之前测试 HTML 字符串:

if response_text.lstrip().startswith('<!DOCTYPE html'):
    # we received an html response, sleep again
...

我也无法让 minidom 在包含选项卡实体的属性上爆炸。也许它是一个不正确终止的实体序列,例如没有结尾分号的 ? Minidom 似乎可以接受属性内正确转义的实体:

text = '<root><a href="	foo<">link</a></root>'
tree = minidom.parseString(text)
print tree.toxml()

u'<?xml version="1.0" ?>\n<root><a href="\tfoo<">link</a></root>'

You could also test the string for HTML before attempting to parse the result:

if response_text.lstrip().startswith('<!DOCTYPE html'):
    # we received an html response, sleep again
...

I also couldn't get minidom to blow up on an attribute containing a tab entity. Perhaps it is an improperly terminated entity sequence, like without the ending semicolon? Minidom seems okay with properly-escaped entities inside attributes:

text = '<root><a href="	foo<">link</a></root>'
tree = minidom.parseString(text)
print tree.toxml()

u'<?xml version="1.0" ?>\n<root><a href="\tfoo<">link</a></root>'
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文