在 lxml 中测试元素时避免循环
我有这个问题,我正在使用 lxml 处理一些表 - 原始源文件是 mhtml 格式,它们是 excel 文件。我需要找到包含标题元素“th”元素的行。我想使用标题元素,但需要它们来自的行以确保我按顺序处理所有内容。
所以我一直在做的是找到所有 th 元素,然后使用 e.getparent() 函数从那些元素中获取行(因为 th 是行的子元素)。但我最终不得不拉动 th 元素两次,一次是为了找到它们并获取行,然后再次将它们从行中取出来解析我正在查找的数据。 这不是最好的方法,所以我想知道是否缺少一些东西。
这是我的代码
from lxml import html
theString=unicode(open('c:\\secexcel\\1314054-R20110331-C20101231-F60-SEQ132.xls').read(),'UTF-8','replace')
theTree=html.fromstring(theString)
tables=[e for e in theTree.iter() if e.tag=='table']
for table in tables :
headerCells=[e for e in table.iter() if e.tag=='th']
headerRows=[]
for headerCell in headerCells:
if headerCell.getparent().tag=='tr':
if headerCell.getparent() not in headerRows:
headerRows.append(headerCell.getparent())
for headerRow in headerRows:
newHeaderCells=[e for e in headerRow.iter() if e.tag=='th']
#Now I will extract some data and attributes from the th elements
I have this problem, I am processing some tables using lxml- the original source files are in mhtml format, they are excel files. I am needing to find the rows that contain the header elements 'th' elements. I want to use the header elements but need the rows they came from to make sure I process everything in order.
So what I have been doing is finding all of the th elements and then from those using the e.getparent() function to get the row (since a th is a child of a row). But I end up having to pull the th elements twice, once to find them and get the rows and then again to take them out of the rows to parse the data I am looking for.
This can't be the best way to do this so I am wondering if there is something I am missing.
Here is my code
from lxml import html
theString=unicode(open('c:\\secexcel\\1314054-R20110331-C20101231-F60-SEQ132.xls').read(),'UTF-8','replace')
theTree=html.fromstring(theString)
tables=[e for e in theTree.iter() if e.tag=='table']
for table in tables :
headerCells=[e for e in table.iter() if e.tag=='th']
headerRows=[]
for headerCell in headerCells:
if headerCell.getparent().tag=='tr':
if headerCell.getparent() not in headerRows:
headerRows.append(headerCell.getparent())
for headerRow in headerRows:
newHeaderCells=[e for e in headerRow.iter() if e.tag=='th']
#Now I will extract some data and attributes from the th elements
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
遍历所有
tr
标签,当您发现里面没有th
时,直接转到下一个标签。编辑。方法如下:
Iterate over all
tr
tags, and just move on to the next one when you find noth
inside.EDIT. This is how:
为了避免执行两次,您可以使用由行元素作为键控的字典,并将给定行中的所有标题单元格累积到关联列表中,这可以通过表元素的单次传递来完成。要保持行按出现时间排序,您可以使用内置
collections
模块中的OrderedDict
。这将允许编写类似这样的内容:To avoid doing it twice, you could use a dictionary keyed by row element and accumulate all the header cells from a given row into an assocated list, which can be done in a single pass through the table's elements. To keep rows ordered by when they were seen you can use an
OrderedDict
from the built-incollections
module. This would allow something along these lines to be written: