python、lxml 和 xpath - html 表解析

发布于 2024-08-07 11:54:18 字数 832 浏览 1 评论 0原文

我是 lxml 新手，对 python 很陌生，无法找到以下问题的解决方案：

我需要导入一些具有 3 列和从第 3 行开始的未定义行数的表。

当任何行的第二列是为空，该行将被丢弃，并且表的处理将中止。

以下代码可以很好地打印表的数据（但之后我无法重用这些数据）：

from lxml.html import parse

def process_row(row):  
    for cell in row.xpath('./td'):  
        print cell.text_content()  
        yield cell.text_content()  

def process_table(table):  
    return [process_row(row) for row in table.xpath('./tr')]

doc = parse(url).getroot()  
tbl = doc.xpath("/html//table[2]")[0]  
data = process_table(tbl)

这仅打印第一列:(

for i in data:  
    print i.next()

以下代码仅导入第三行，而不是后续

tbl = doc.xpath("//body/table[2]//tr[position()>2]")[0]

任何人都知道一个奇特的解决方案来获取所有将第 3 行的数据复制到 tbl 中并将其复制到数组中，以便可以将其处理到没有 lxml 依赖项的模块中？

提前感谢您的帮助，Alex

原文

I 'am new to lxml, quite new to python and could not find a solution to the following:

I need to import a few tables with 3 columns and an undefined number of rows starting at row 3.

When the second column of any row is empty, this row is discarded and the processing of the table is aborted.

The following code prints the table's data fine (but I'm unable to reuse the data afterwards):

from lxml.html import parse

def process_row(row):  
    for cell in row.xpath('./td'):  
        print cell.text_content()  
        yield cell.text_content()  

def process_table(table):  
    return [process_row(row) for row in table.xpath('./tr')]

doc = parse(url).getroot()  
tbl = doc.xpath("/html//table[2]")[0]  
data = process_table(tbl)

This only prints the first column :(

for i in data:  
    print i.next()

The following only import the third row, and not the subsequent

tbl = doc.xpath("//body/table[2]//tr[position()>2]")[0]

Anyone knows a fancy solution to get all the data from row 3 into tbl and copy it into an array so it can be processed into a module with no lxml dependency?

Thanks in advance for your help, Alex

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一萌ing 2024-08-14 11:54:18

这是一个生成器：

def process_row(row):  
     for cell in row.xpath('./td'):  
         print cell.text_content()  
         yield cell.text_content()

您调用它就好像您认为它返回一个列表一样。事实并非如此。在某些情况下，它的行为类似于列表：

print [r for r in process_row(row)]

但这只是因为生成器和列表都向 for 循环公开相同的接口。在仅评估一次的上下文中使用它，例如：

return [process_row(row) for row in table.xpath('./tr')]

只需为 row 的每个新值调用一次生成器的新实例，返回产生的第一个结果。

所以这是你的第一个问题。您的第二个是您所期望的：

tbl = doc.xpath("//body/table[2]//tr[position()>2]")[0]

为您提供第三行和所有后续行，并且它仅将 tbl 设置为第三行。嗯，对 xpath 的调用返回第三行和所有后续行。是最后的 [0] 搞乱了你。

This is a generator:

def process_row(row):  
     for cell in row.xpath('./td'):  
         print cell.text_content()  
         yield cell.text_content()

You're calling it as though you thought it returns a list. It doesn't. There are contexts in which it behaves like a list:

print [r for r in process_row(row)]

but that's only because a generator and a list both expose the same interface to for loops. Using it in a context where it gets evaluated just one time, e.g.:

return [process_row(row) for row in table.xpath('./tr')]

just calls a new instance of the generator once for each new value of row, returning the first result yielded.

So that's your first problem. Your second one is that you're expecting:

tbl = doc.xpath("//body/table[2]//tr[position()>2]")[0]

to give you the third and all subsequent rows, and it's only setting tbl to the third row. Well, the call to xpath is returning the third and all subsequent rows. It's the [0] at the end that's messing you up.

回复收藏 0 原文

打小就很酷 2024-08-14 11:54:18

您需要使用循环来访问该行的数据，如下所示：

for row in data:  
    for col in row:
        print col

像您一样调用一次 next() 将仅访问第一项，这就是您看到一列的原因。

请注意，由于生成器的性质，您只能访问它们一次。如果您将 process_row(row) 调用更改为 list(process_row(row))，则生成器将转换为可以重用的列表。

更新：如果您只需要第三行及以上，请使用 data[2:]

You need to use a loop to access the row's data, like this:

for row in data:  
    for col in row:
        print col

Calling next() once as you did will access only the first item, which is why you see one column.

Note that due to the nature of generators, you can only access them once. If you changed the call process_row(row) into list(process_row(row)), the generator would be converted to a list which can be reused.

Update: If you need just the 3rd row and on, use data[2:]

回复收藏 0 原文

~没有更多了~