如何使用lxml查找XHTML文档中的元素文本
我已经为此苦恼了很多年,我一定是做了一些愚蠢的事情。
我正在尝试检索所有可能的维基百科支持的语言,并通过遍历 List_of_Wikipedias 上的表将它们输出到文本文件。
到目前为止,这是我的 python 代码,它只是尝试检索其中一个表:
import httplib
from lxml import etree
def main():
conn = httplib.HTTPConnection("meta.wikimedia.org")
conn.request("GET","/wiki/List_of_Wikipedias")
res = conn.getresponse()
root = etree.fromstring(res.read())
table = root.xpath('//table')
print table
main()
在我的机器上,这只打印一个空列表。为了提高速度,我在本地缓存了页面并使用:
wikipage = open("wikipage.html")
root = lxml.parse(wikipage)
但这不会产生任何影响(除了明显的加速之外)。我还尝试过
lxml.find('table')
and:
for element in root.iter():
print("%s - %s" % (element.tag, element.text))
成功打印出所有元素,所以我知道正在创建树。
我做错了什么?
任何帮助将不胜感激。 谢谢。
I've been bashing my head at this for ages, I must be doing something stupid.
I am trying to retrieve all of the possible Wikipedia supported languages and output them to a text file by traversing the tables on List_of_Wikipedias.
Here is my python code so far, which is simply trying to retrieve one of the tables:
import httplib
from lxml import etree
def main():
conn = httplib.HTTPConnection("meta.wikimedia.org")
conn.request("GET","/wiki/List_of_Wikipedias")
res = conn.getresponse()
root = etree.fromstring(res.read())
table = root.xpath('//table')
print table
main()
On my machine this only prints an empty list. To increase speed I cached the page locally and used:
wikipage = open("wikipage.html")
root = lxml.parse(wikipage)
but this makes no impact whatsoever (other than the obvious speedup). I have also tried
lxml.find('table')
and:
for element in root.iter():
print("%s - %s" % (element.tag, element.text))
which successfully prints out all of the elements, so I know the tree is being created.
What am I doing wrong?
Any help would be appreciated.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您的问题是文档中的元素名称位于默认命名空间中。如何编写涉及此类元素名称的 XPath 表达式是 XPath 中最常见的常见问题,并且在 SO xpath 标签中有许多很好的答案。只要寻找他们就可以了。
这是一个完整的解决方案:
使用:
您已注册绑定到前缀的 XHTML 命名空间 (
"http://www.w3.org/1999/xhtml"
)“x”
。当我根据从以下位置获得的文档评估此 XPath 表达式时: http://s23.org/wikistats/wikipedias_html< /a>
我需要在文档的开头添加以下内容,因为我在本地工作并且没有 XHTML 的 DTD - 也许您不需要这些:
应用的结果本文档的上述 XPath 表达式为:
请注意:每隔一个选定的节点都是一个仅包含空格的文本节点。如果您不想选择这些,请使用:
Your problem is that the element names in the document are in a default namespace. How to write XPath expressions that involve such element names is the most FAQ in XPath and has numerous good answer in the SO xpath tag. Just search for them.
Here is a complete solution:
Use:
where you have registered the XHTML namespace (
"http://www.w3.org/1999/xhtml"
) bound to the prefix"x"
.When I evaluated this XPath expression against the document obtained from: http://s23.org/wikistats/wikipedias_html
I needed to add the following at the start of the document, because I was working locally and didn't have the DTD for XHTML -- maybe you will not need these:
The result of applying the above XPath expression to this document is:
Do note: Every second selected node is a white-space-only text node. If you don't want these selected, use:
将其解析为 html。
输出
Parse it as html.
Output
XPath 需要命名空间。您下载的页面开始:
所以您实际上想要
其中
html
是绑定到"http://www.w3.org/1999/xhtml"
的前缀,您将拥有了解如何在 lxml 中绑定名称空间 - 我不是 python 专家。
如果这是你的问题,我表示同情 - 它已经让我和其他许多人陷入困境!
XPath requires namespaces. The page you have downloaded starts:
So you actually want
where
html
is the prefix bound to"http://www.w3.org/1999/xhtml"
You will have to find out how to bind namespaces in lxml - I am not a python expert.
If this is your problem I sympathize - it has caught me and many others out!