当 XML 属性包含 unicode 时 xml.dom.minidom.parse() 失败
我正在使用 urllib2.request 查询 Web 服务并接收 XML。如果我违反了 Web 服务的速率限制(1 次调用/秒),我会收到返回的 HTML,说明我已违反了速率限制。
尽管我可以在每次调用后 time.sleep() 2-3 秒,但无论出于何种原因,我仍然违反了速率限制。
为了测试我的响应是 XML 还是 HTML,我使用 xml.dom.minidom() 然后测试是否存在 html 元素
try:
dom = xml.dom.minidom.parseString(response_text)
except xml.parsers.expat.ExpatError:
return False
if len(dom.getElementsByTagName('html')) == 0:
return True
else:
return False
这完成了工作,但我遇到了其中一个 XML 的情况属性包含 XML。在这种情况下,parseString() 命令会失败,并显示
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/python/default-2.6/lib/python2.6/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/opt/python/default-2.6/lib/python2.6/xml/dom/expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "/opt/python/default-2.6/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 3125
In this case, column 3125 is part of some attribute value text that contains ampersand-pound-x-9 (Stackoverflow is hide my unicode)。
xml.dom.minidom 应该能够处理这个问题吗?除此之外,XML 是否还有其他问题导致解析失败?
此外,如果社区有其他处理此类情况的方法,我也持开放态度。
如果有帮助的话,以下是当我违反其速率限制时网络服务返回的内容:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="eng">
<head>
<title>Service Temporarily Unavailable - Rate Limited</title>
</head>
<body style="text-align:center;background-color:white;">
<h1>Service Temporarily Unavailable</h1>
<hr />
<div>
You have used this service too often in a short time. Please wait before using this service again.
<br/><br/>
Please visit the <a href="http://wiki.xxxx.com/index.php?title=API_Usage">wiki</a> for more details.
</div>
</body>
</html>
I'm querying a web service using urllib2.request and receiving XML. If I violate the web service's rate limit (1 call/second), I receive HTML back saying I've violated the rate limit.
Even though I can time.sleep() for 2-3 seconds after each call, I still, for whatever reason, violate the rate limit.
To test that my response is either XML or HTML, I'm using xml.dom.minidom() and then testing for the presence of an html element
try:
dom = xml.dom.minidom.parseString(response_text)
except xml.parsers.expat.ExpatError:
return False
if len(dom.getElementsByTagName('html')) == 0:
return True
else:
return False
This gets the job done but I've run into a case where one of the XML attributes contains XML. In that case, the parseString() command fails with
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/python/default-2.6/lib/python2.6/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/opt/python/default-2.6/lib/python2.6/xml/dom/expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "/opt/python/default-2.6/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 3125
In this case, column 3125 is part of some attribute value text that contains ampersand-pound-x-9 (Stackoverflow is hiding my unicode).
Should xml.dom.minidom be able to handle this? Could there be another issue with the XML besides this that's causing the parsing to fail?
Additionally, I'm open to other ways of handling this type of situation if the community has one.
If it helps, here is what the web service returns when I've violated their rate limit:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="eng">
<head>
<title>Service Temporarily Unavailable - Rate Limited</title>
</head>
<body style="text-align:center;background-color:white;">
<h1>Service Temporarily Unavailable</h1>
<hr />
<div>
You have used this service too often in a short time. Please wait before using this service again.
<br/><br/>
Please visit the <a href="http://wiki.xxxx.com/index.php?title=API_Usage">wiki</a> for more details.
</div>
</body>
</html>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为
是一个选项卡。您应该尝试 http://docs.python.org/library/htmllib.html# module-htmlentitydefs 将特殊的 html 实体转换回它们本来的样子。 (这可能有
<
等问题)。或者您可以进行字符串替换,用空格替换。
作为一个建议,当您正在解析内容并且解析器遇到问题(例如不适合您的模式)时,您应该允许解析器继续,但发出警告,而不是停止操作。通过这种方式,您可以看到问题所在,并可能纠正它,或者至少看到存在问题。
另外,关于速率限制的问题,为什么不缓存一次请求的 HTML,以便您可以在本地执行处理。
I think that
is a tab. You should try http://docs.python.org/library/htmllib.html#module-htmlentitydefs to convert special html entities back to whatever they are. (That may have the problem of
<
etc). Or you could do a string substitution that substitutewith a space.
Just as a suggestion, when you're parsing stuff, and the parser runs into a problem, such as not fitting your pattern, instead of stopping the operation, you should allow the parser to continue, but spit out a warning. This way you can see what the problem is, and potentially correct it, or at least see that there's a problem.
Also as to your problem with the rate limit, why not cache the requested HTML once so you can perform processing locally.
您还可以在尝试解析结果之前测试 HTML 字符串:
我也无法让 minidom 在包含选项卡实体的属性上爆炸。也许它是一个不正确终止的实体序列,例如没有结尾分号的
? Minidom 似乎可以接受属性内正确转义的实体:
You could also test the string for HTML before attempting to parse the result:
I also couldn't get minidom to blow up on an attribute containing a tab entity. Perhaps it is an improperly terminated entity sequence, like
without the ending semicolon? Minidom seems okay with properly-escaped entities inside attributes: