使用 xpath 提取行中的表格单元格文本内容以供使用?

发布于 2025-01-07 23:49:55 字数 1722 浏览 0 评论 0原文

我有一些关于 HTML 的内容。我想提取表格单元格的各种内容,但是我发现单元格中偶尔会有一些嵌入的 div,也许还有其他我不确定的奇怪现象:

<p align="center">
    <img src="some_image.gif" alt="Some Title">
</p>
<TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0>
<TR>
<TD colspan=4 ALIGN=center><b>Title</b></TD>
</TR>
<TR>
<TD ALIGN=center>Title</TD>
<TD ALIGN=center>date</TD>
<TD ALIGN=center>value</TD>
<TD ALIGN=center>value</TD>
</TR><TR>
  <TD ALIGN=center>Title2</TD>
  <TD ALIGN=center></TD>
  <TD ALIGN=center><div class=redtext>----</div></TD>
  <TD>&nbsp;</TD>
</TR><TR>
  <TD ALIGN=center>Title3</TD>
  <TD ALIGN=center><div class=yellowtext>value</div></TD>
  <TD ALIGN=center><div class=redtext>value</div></TD>
  <TD ALIGN=center>value<SUP>6</SUP></TD>
</TR><TR>
  <TD ALIGN=center>Title4</TD>
  <TD ALIGN=center><div class=bluetext>value</div></TD>
  <TD ALIGN=center><div class=redtext>value</div></TD>
  <TD>&nbsp;</TD>
</TR></TABLE>

<blockquote>
    <p class="textstyle">
        Text.
    </p>
</blockquote>

我的第一个冲动是提取所有元素文本,然后以编程方式将其切片。我会观察 Title1、Title2 等,以了解一行何时开始,然后如果发现“----”意味着没有值,则跳过该行并继续。然而,我意识到可能有更好的方法直接使用 xpath 处理这个问题。

如何使用 xpath 解决这个问题,以便本质上给出每个单元格的最终子文本内容,而不是必须走进每个 div(如果存在)?或者有没有一种更像 xpath 的方法来解决这个问题?

显然,我正在尝试拥有最灵活的解决方案,即使出现其他意外因素,即使它们不太可能出现,该解决方案也不会脆弱。

I have something along the following lines in terms of HTML. I would like to extract the various contents of the table cells, however I discovered that there are some embedded divs occasionally in the cells and perhaps other oddities that I'm not sure of yet:

<p align="center">
    <img src="some_image.gif" alt="Some Title">
</p>
<TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0>
<TR>
<TD colspan=4 ALIGN=center><b>Title</b></TD>
</TR>
<TR>
<TD ALIGN=center>Title</TD>
<TD ALIGN=center>date</TD>
<TD ALIGN=center>value</TD>
<TD ALIGN=center>value</TD>
</TR><TR>
  <TD ALIGN=center>Title2</TD>
  <TD ALIGN=center></TD>
  <TD ALIGN=center><div class=redtext>----</div></TD>
  <TD> </TD>
</TR><TR>
  <TD ALIGN=center>Title3</TD>
  <TD ALIGN=center><div class=yellowtext>value</div></TD>
  <TD ALIGN=center><div class=redtext>value</div></TD>
  <TD ALIGN=center>value<SUP>6</SUP></TD>
</TR><TR>
  <TD ALIGN=center>Title4</TD>
  <TD ALIGN=center><div class=bluetext>value</div></TD>
  <TD ALIGN=center><div class=redtext>value</div></TD>
  <TD> </TD>
</TR></TABLE>

<blockquote>
    <p class="textstyle">
        Text.
    </p>
</blockquote>

My first impulse was to extract ALL element texts and just programmatically slice it up. I would watch for Title1, Title2, etc. to know when a row starts and then if a "----" is found meaning no value, just skip this row and move on. However, I realized that there is probably a better way of handling this with xpath directly.

How could this be solved with xpath so as to essentially give each cell's final child text content vs having to walk into each div if it exists? Or is there a more xpath like way to approach this?

Obviously I'm attempting to have the most flexible solution that will not be brittle if other unexpected elements crop up, even though they are unlikely.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

夜血缘 2025-01-14 23:49:55

提供的文本不是格式良好的 XML 文档,因此 XPath 不适用

如果您将其更正并将其转换为格式良好的 xml 文档(如下所示),则像这样的表达式可能会很有用:

/*/TABLE//TD//text()

甚至:

//TABLE//TD//text()

这是一个格式良好的 XML 文档,由提供的 HTML 构造而成:

<html>
    <p align="center">
        <img src="some_image.gif" alt="Some Title"/>
    </p>
    <TABLE WIDTH="500" BORDER="1" class="textwhite" ALIGN="center" CELLPADDING="0" CELLSPACING="0">
        <TR>
            <TD colspan="4" ALIGN="center">
                <b>Title</b>
            </TD>
        </TR>
        <TR>
            <TD ALIGN="center">Title</TD>
            <TD ALIGN="center">date</TD>
            <TD ALIGN="center">value</TD>
            <TD ALIGN="center">value</TD>
        </TR>
        <TR>
            <TD ALIGN="center">Title2</TD>
            <TD ALIGN="center"></TD>
            <TD ALIGN="center">
                <div class="redtext">----</div>
            </TD>
            <TD> </TD>
        </TR>
        <TR>
            <TD ALIGN="center">Title3</TD>
            <TD ALIGN="center">
                <div class="yellowtext">value</div>
            </TD>
            <TD ALIGN="center">
                <div class="redtext">value</div>
            </TD>
            <TD ALIGN="center">value
                <SUP>6</SUP>
            </TD>
        </TR>
        <TR>
            <TD ALIGN="center">Title4</TD>
            <TD ALIGN="center">
                <div class="bluetext">value</div>
            </TD>
            <TD ALIGN="center">
                <div class="redtext">value</div>
            </TD>
            <TD> </TD>
        </TR>
    </TABLE>
    <blockquote>
        <p class="textstyle">         Text.     </p>
    </blockquote>
</html>

The provided text isn't well-formed XML document, therefore XPath isn't applicable.

If you correct and covert it to a well-formed xml document as the one below, an expression like this might be useful:

/*/TABLE//TD//text()

or even:

//TABLE//TD//text()

Here is a wellformed XML document, constructed from the provided HTML:

<html>
    <p align="center">
        <img src="some_image.gif" alt="Some Title"/>
    </p>
    <TABLE WIDTH="500" BORDER="1" class="textwhite" ALIGN="center" CELLPADDING="0" CELLSPACING="0">
        <TR>
            <TD colspan="4" ALIGN="center">
                <b>Title</b>
            </TD>
        </TR>
        <TR>
            <TD ALIGN="center">Title</TD>
            <TD ALIGN="center">date</TD>
            <TD ALIGN="center">value</TD>
            <TD ALIGN="center">value</TD>
        </TR>
        <TR>
            <TD ALIGN="center">Title2</TD>
            <TD ALIGN="center"></TD>
            <TD ALIGN="center">
                <div class="redtext">----</div>
            </TD>
            <TD> </TD>
        </TR>
        <TR>
            <TD ALIGN="center">Title3</TD>
            <TD ALIGN="center">
                <div class="yellowtext">value</div>
            </TD>
            <TD ALIGN="center">
                <div class="redtext">value</div>
            </TD>
            <TD ALIGN="center">value
                <SUP>6</SUP>
            </TD>
        </TR>
        <TR>
            <TD ALIGN="center">Title4</TD>
            <TD ALIGN="center">
                <div class="bluetext">value</div>
            </TD>
            <TD ALIGN="center">
                <div class="redtext">value</div>
            </TD>
            <TD> </TD>
        </TR>
    </TABLE>
    <blockquote>
        <p class="textstyle">         Text.     </p>
    </blockquote>
</html>
我早已燃尽 2025-01-14 23:49:55

所以也许你不想遍历 div,但这是我使用 lxml 的解决方案,我强烈推荐:

import re
from cStringIO import StringIO
from lxml import etree

def getTable(html, table_xpath, rows_xpath, cells_xpath):
    """Get a table on a webpage"""
    parser = etree.HTMLParser()
    # Build document tree and get table
    root = etree.parse(StringIO(html), parser)
    table = root.find(table_xpath)
    if table == None:
        print 'No table.'
        return []
    rows = table.findall(rows_xpath)
    document = []
    def cleanText(text):
        """Clean up text by replacing line breaks and tabs. """
        return re.sub(r'[\r\n\t]+','',str(text).strip())
    # iterate over the table rows and collect text from each cell.
    for r in rows:
        cells = r.findall(cells_xpath)
        rowdata = []
        for c in cells:
            text = ''
            it = c.itertext()
            for i in it:
                text += cleanText(i) + ' '
            rowdata.append(text)
        document.append(rowdata)
    return document


html = """
<html><head><title></title></head><body>
<p align="center">
    <img src="some_image.gif" alt="Some Title">
    </p>
    <TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0>
    <TR>
    <TD colspan=4 ALIGN=center><b>Title</b></TD>
    </TR>
    <TR>
    <TD ALIGN=center>Title</TD>
    <TD ALIGN=center>date</TD>
    <TD ALIGN=center>value</TD>
    <TD ALIGN=center>value</TD>
    </TR><TR>
    <TD ALIGN=center>Title2</TD>
    <TD ALIGN=center></TD>
    <TD ALIGN=center><div class=redtext>----</div></TD>
    <TD> </TD>
    </TR><TR>
    <TD ALIGN=center>Title3</TD>
    <TD ALIGN=center><div class=yellowtext>value</div></TD>
    <TD ALIGN=center><div class=redtext>value</div></TD>
    <TD ALIGN=center>value<SUP>6</SUP></TD>
    </TR><TR>
    <TD ALIGN=center>Title4</TD>
    <TD ALIGN=center><div class=bluetext>value</div></TD>
    <TD ALIGN=center><div class=redtext>value</div></TD>
    <TD> </TD>
</TR></TABLE>   
</body>
</html>
"""
tp = "//table[@width='500']"
rt = "tr"
cp = "td[@align='center']"

doc = getTable(html, tp, rt, cp)
print repr(doc)

So maybe you don't want to walk the divs, but here is my solution using lxml, which I highly recommend:

import re
from cStringIO import StringIO
from lxml import etree

def getTable(html, table_xpath, rows_xpath, cells_xpath):
    """Get a table on a webpage"""
    parser = etree.HTMLParser()
    # Build document tree and get table
    root = etree.parse(StringIO(html), parser)
    table = root.find(table_xpath)
    if table == None:
        print 'No table.'
        return []
    rows = table.findall(rows_xpath)
    document = []
    def cleanText(text):
        """Clean up text by replacing line breaks and tabs. """
        return re.sub(r'[\r\n\t]+','',str(text).strip())
    # iterate over the table rows and collect text from each cell.
    for r in rows:
        cells = r.findall(cells_xpath)
        rowdata = []
        for c in cells:
            text = ''
            it = c.itertext()
            for i in it:
                text += cleanText(i) + ' '
            rowdata.append(text)
        document.append(rowdata)
    return document


html = """
<html><head><title></title></head><body>
<p align="center">
    <img src="some_image.gif" alt="Some Title">
    </p>
    <TABLE WIDTH=500 BORDER=1 class=textwhite ALIGN=center CELLPADDING=0 CELLSPACING=0>
    <TR>
    <TD colspan=4 ALIGN=center><b>Title</b></TD>
    </TR>
    <TR>
    <TD ALIGN=center>Title</TD>
    <TD ALIGN=center>date</TD>
    <TD ALIGN=center>value</TD>
    <TD ALIGN=center>value</TD>
    </TR><TR>
    <TD ALIGN=center>Title2</TD>
    <TD ALIGN=center></TD>
    <TD ALIGN=center><div class=redtext>----</div></TD>
    <TD> </TD>
    </TR><TR>
    <TD ALIGN=center>Title3</TD>
    <TD ALIGN=center><div class=yellowtext>value</div></TD>
    <TD ALIGN=center><div class=redtext>value</div></TD>
    <TD ALIGN=center>value<SUP>6</SUP></TD>
    </TR><TR>
    <TD ALIGN=center>Title4</TD>
    <TD ALIGN=center><div class=bluetext>value</div></TD>
    <TD ALIGN=center><div class=redtext>value</div></TD>
    <TD> </TD>
</TR></TABLE>   
</body>
</html>
"""
tp = "//table[@width='500']"
rt = "tr"
cp = "td[@align='center']"

doc = getTable(html, tp, rt, cp)
print repr(doc)
时光与爱终年不遇 2025-01-14 23:49:55

我相信你的程序在操作输入数据时会遇到很多问题——如果“标题”的大小写改变或者有拼写错误怎么办?

确实不可能制定严格的解决方案来抓取其他人的网站,因为他们无法在任何情况下完全改变一切。通常更好的方法是编写宽容且灵活的代码,至少尝试验证其输出是否正常。在这种情况下,最好迭代“//table/tr”的结果,然后在该循​​环内处理 td 元素:

import lxml.etree
tree = lxml.etree.fromstring("<table><tr><td>test</td></tr><tr><td><div>test2</div></td></tr></table>")
stringify = lambda x : "".join(x.xpath(".//text()"))
for x in tree.xpath("//table/tr"):
    print "New row"
    for y in x.xpath("td"):
        print stringify(y)

输出:

New row
test
New row
test2

但是,以下代码将获取您要求的列表:

print map(stringify, tree.xpath("//table/tr/td"))

输出:

['test', 'test2']

这将查找所有源自 td 的文本元素,td 是 tr 的直接后代,而 tr 又是 table 的直接后代。

(在包含“Foo bar”或类似内容的 HTML 上运行时,简单地询问所有 text() 元素会产生一些有趣的错误。)

I believe that your program is going to run into many problems as the input data is manipulated -- what if the case of 'title' changes, or there is a typo?

It's not really possible to make a rigorous solution to scraping someone else's website, as they can at no notice completely change everything. Better is normally to write tolerant and flexible code that at least tries to verify that its output is sane. In this case it's probably best to iterate over the results of '//table/tr', then inside this loop, process the td elements:

import lxml.etree
tree = lxml.etree.fromstring("<table><tr><td>test</td></tr><tr><td><div>test2</div></td></tr></table>")
stringify = lambda x : "".join(x.xpath(".//text()"))
for x in tree.xpath("//table/tr"):
    print "New row"
    for y in x.xpath("td"):
        print stringify(y)

Output:

New row
test
New row
test2

The following code will, however, get the list you ask for:

print map(stringify, tree.xpath("//table/tr/td"))

Output:

['test', 'test2']

This will find all text elements which are at all descended from a td which is a direct descendant of a tr which is in turn a direct descendant of a table.

(Simply asking for all text() elements will create some funny bugs when run on HTML which contains "<td>Foo <b>bar</b></td>" or similar.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文