Python BeautifulSoup 解析特定文本

发布于 2024-12-26 00:55:16 字数 2937 浏览 0 评论 0原文

我正在解析一个 html 文件,我想找到文件中写着“较小的报告公司”的部分,并且旁边有一个“X”或复选框,或者没有。该复选框通常使用 Wingdings 字体或 ascii 代码完成。在下面的 HTML 中,您会看到它旁边的 wingdings 中有一个 þ

显示文本的正则表达式搜索结果没有问题,但我在进行下一步并查找复选框时遇到问题。

我将使用它来解析许多不同的 html 文件,这些文件不会全部遵循相同的格式,但大多数文件将使用表格和 ascii 文本,如本例所示。

这是 HTML 代码:

<HTML>
<HEAD><TITLE></TITLE></HEAD>
<BODY>
<DIV align="left">Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, or a smaller reporting company. See the definitions of &#147;large accelerated filer,&#148; &#147;accelerated filer&#148; and &#147;smaller reporting company&#148;. (Check one):
</DIV>

<DIV align="center">
<TABLE style="font-size: 10pt" cellspacing="0" border="0" cellpadding="0" width="100%">
<!-- Begin Table Head -->
<TR valign="bottom">
    <TD width="22%">&nbsp;</TD>
    <TD width="3%">&nbsp;</TD>
    <TD width="22%">&nbsp;</TD>
    <TD width="3%">&nbsp;</TD>
    <TD width="22%">&nbsp;</TD>
    <TD width="3%">&nbsp;</TD>
    <TD width="22%">&nbsp;</TD>
</TR>
<TR></TR>
<!-- End Table Head -->
<!-- Begin Table Body -->
<TR valign="bottom">
    <TD align="center" valign="top"><FONT style="white-space: nowrap"> Large accelerated filer <FONT style="font-family: Wingdings">&#111;</FONT></FONT>
    </TD>
    <TD>&nbsp;</TD>
    <TD align="center" valign="top"><FONT style="white-space: nowrap">Accelerated filer <FONT style="font-family: Wingdings">&#111;</FONT></FONT>
    </TD>
    <TD>&nbsp;</TD>
    <TD align="center" valign="top"><FONT style="white-space: nowrap"> Non-accelerated filer <FONT style="font-family: Wingdings">&#111;</FONT> </FONT>
    <FONT style="white-space: nowrap">(Do not check if a smaller reporting company)</FONT>
    </TD>
    <TD>&nbsp;</TD>
    <TD align="center" valign="top"><FONT style="white-space: nowrap"> Smaller reporting company <FONT style="font-family: Wingdings">&#254;</FONT></FONT></TD>
</TR>
<!-- End Table Body -->
</TABLE>
</DIV></BODY></HTML>

这是我的 Python 代码:

import os, sys, string, re
from BeautifulSoup import BeautifulSoup

rawDataFile = "testfile1.html"
f = open(rawDataFile)
soup = BeautifulSoup(f)
f.close()

search = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany'))
print search

问题: 我怎样才能将其设置为依赖于第一次搜索的第二次搜索?因此,当我找到“较小的报告公司”时,我可以搜索接下来的几行,看看是否有 ascii 代码?我一直在浏览汤文档。我尝试执行 find 和 findNext 但无法使其正常工作。

I am parsing an html file and I want to find the part of the file where it says "Smaller Reporting Company" and either has an "X" or Checkbox next to it or doesn't. The checkbox is typically done with the Wingdings font or an ascii code. In the HTML below you'll see it has an þ in wingdings next to it.

I have no problem showing the results of a regular expression search for the text, but I'm having trouble going the next step and looking for a check box.

I will be using this to parse a number of different html files that won't all follow the same format, but most of them will use a table and ascii text like this example.

Here is the HTML code:

<HTML>
<HEAD><TITLE></TITLE></HEAD>
<BODY>
<DIV align="left">Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, or a smaller reporting company. See the definitions of “large accelerated filer,” “accelerated filer” and “smaller reporting company”. (Check one):
</DIV>

<DIV align="center">
<TABLE style="font-size: 10pt" cellspacing="0" border="0" cellpadding="0" width="100%">
<!-- Begin Table Head -->
<TR valign="bottom">
    <TD width="22%"> </TD>
    <TD width="3%"> </TD>
    <TD width="22%"> </TD>
    <TD width="3%"> </TD>
    <TD width="22%"> </TD>
    <TD width="3%"> </TD>
    <TD width="22%"> </TD>
</TR>
<TR></TR>
<!-- End Table Head -->
<!-- Begin Table Body -->
<TR valign="bottom">
    <TD align="center" valign="top"><FONT style="white-space: nowrap"> Large accelerated filer <FONT style="font-family: Wingdings">o</FONT></FONT>
    </TD>
    <TD> </TD>
    <TD align="center" valign="top"><FONT style="white-space: nowrap">Accelerated filer <FONT style="font-family: Wingdings">o</FONT></FONT>
    </TD>
    <TD> </TD>
    <TD align="center" valign="top"><FONT style="white-space: nowrap"> Non-accelerated filer <FONT style="font-family: Wingdings">o</FONT> </FONT>
    <FONT style="white-space: nowrap">(Do not check if a smaller reporting company)</FONT>
    </TD>
    <TD> </TD>
    <TD align="center" valign="top"><FONT style="white-space: nowrap"> Smaller reporting company <FONT style="font-family: Wingdings">þ</FONT></FONT></TD>
</TR>
<!-- End Table Body -->
</TABLE>
</DIV></BODY></HTML>

Here is my Python code:

import os, sys, string, re
from BeautifulSoup import BeautifulSoup

rawDataFile = "testfile1.html"
f = open(rawDataFile)
soup = BeautifulSoup(f)
f.close()

search = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany'))
print search

Question:
How could I set this up to have a second search that is dependent upon the first search? So when I find "smaller reporting company" I can search the next few lines to see if there is an ascii code? I've been going through the soup docs. I tried to do find and findNext but I haven't been able to get it to work.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

蓝礼 2025-01-02 00:55:16

如果你知道翅膀角色的位置不会改变,你可以使用.next

>>> nodes = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany'))
>>> nodes[-1].next.next  # last item in list is the only good one... kinda crap
u'þ'

或者你可以上去,然后从那里查找

>>> nodes[-1].parent.find('font',style="font-family: Wingdings").next
u'þ'

或者你也可以反过来做:

>>> soup.findAll(text='þ')[0].previous.previous
u' Smaller reporting company '

这假设你知道你正在寻找的翅膀特征。

最后一个策略还有一个额外的好处,那就是过滤掉你的正则表达式捕获的其他垃圾,我想你并不真正想要这样做;然后,您可以循环浏览结果,知道您只在正确的列表上工作,因此您可以根据自己的喜好仔细阅读if

If you know the position of the wingding character won't change, you can use .next.

>>> nodes = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany'))
>>> nodes[-1].next.next  # last item in list is the only good one... kinda crap
u'þ'

Or you can go up, and then find from there:

>>> nodes[-1].parent.find('font',style="font-family: Wingdings").next
u'þ'

Or you could do it the other way round:

>>> soup.findAll(text='þ')[0].previous.previous
u' Smaller reporting company '

This assume that you know the wingding caharcters you're looking for.

The last strategy has the added bonus of filtering out other crap that your regex is catching, which I suppose you don't really want; you can then just cycle through results knowing that you're only working on the right list, so you can peruse if to your liking.

风轻花落早 2025-01-02 00:55:16

您可以尝试迭代结构并检查内部标记内的值或检查外部标记中的值。我不记得如何做到这一点,我最终使用 lxml 来实现这一点,但我认为 bsoup 可能能够做到这一点。

如果您无法使用 bsoup 来完成此操作,请查看 lxml。它可能会更快,具体取决于您正在做什么。它还具有将 bsoup 与 lxml 结合使用的钩子。

You may try iterating through the structure and checking for values inside the inner tags or checking for values in the outer tags. I can't remember off hand how to do it and I ended up using lxml for this, but I think bsoup may be able to do this.

If you can't get bsoup to do it check out lxml. It is potentially faster depending upon what you are doing. It also has hooks for using bsoup with lxml.

莳間冲淡了誓言ζ 2025-01-02 00:55:16

lxml 有一个宽容的 HTML 解析器。您不需要 bsoup(现在已被其作者弃用),并且您应该避免使用正则表达式来解析 HTML。

这是您正在寻找的内容的第一个粗剪:

guff = """\
<HTML>
<HEAD><TITLE></TITLE></HEAD>
[snip]
</DIV></BODY></HTML>
"""
from lxml.html import fromstring
doc = fromstring(guff)
for td_el in doc.iter('td'):
    font_els = list(td_el.iter('font'))
    if not font_els: continue
    print
    for el in font_els:
        print (el.text, el.attrib)

这会产生:

(' Large accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})

('Accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})

(' Non-accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})
('(Do not check if a smaller reporting company)', {'style': 'white-space: nowrap
'})

(' Smaller reporting company ', {'style': 'white-space: nowrap'})
(u'\xfe', {'style': 'font-family: Wingdings'})

lxml has a tolerant HTML parser. You don't need bsoup (which is now deprecated by its author) and you should avoid regexes for parsing HTML.

Here is a first rough cut at what you are looking for:

guff = """\
<HTML>
<HEAD><TITLE></TITLE></HEAD>
[snip]
</DIV></BODY></HTML>
"""
from lxml.html import fromstring
doc = fromstring(guff)
for td_el in doc.iter('td'):
    font_els = list(td_el.iter('font'))
    if not font_els: continue
    print
    for el in font_els:
        print (el.text, el.attrib)

This produces:

(' Large accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})

('Accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})

(' Non-accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})
('(Do not check if a smaller reporting company)', {'style': 'white-space: nowrap
'})

(' Smaller reporting company ', {'style': 'white-space: nowrap'})
(u'\xfe', {'style': 'font-family: Wingdings'})
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文