Python BeautifulSoup 解析特定文本
我正在解析一个 html 文件,我想找到文件中写着“较小的报告公司”的部分,并且旁边有一个“X”或复选框,或者没有。该复选框通常使用 Wingdings 字体或 ascii 代码完成。在下面的 HTML 中,您会看到它旁边的 wingdings 中有一个 þ
。
显示文本的正则表达式搜索结果没有问题,但我在进行下一步并查找复选框时遇到问题。
我将使用它来解析许多不同的 html 文件,这些文件不会全部遵循相同的格式,但大多数文件将使用表格和 ascii 文本,如本例所示。
这是 HTML 代码:
<HTML>
<HEAD><TITLE></TITLE></HEAD>
<BODY>
<DIV align="left">Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, or a smaller reporting company. See the definitions of “large accelerated filer,” “accelerated filer” and “smaller reporting company”. (Check one):
</DIV>
<DIV align="center">
<TABLE style="font-size: 10pt" cellspacing="0" border="0" cellpadding="0" width="100%">
<!-- Begin Table Head -->
<TR valign="bottom">
<TD width="22%"> </TD>
<TD width="3%"> </TD>
<TD width="22%"> </TD>
<TD width="3%"> </TD>
<TD width="22%"> </TD>
<TD width="3%"> </TD>
<TD width="22%"> </TD>
</TR>
<TR></TR>
<!-- End Table Head -->
<!-- Begin Table Body -->
<TR valign="bottom">
<TD align="center" valign="top"><FONT style="white-space: nowrap"> Large accelerated filer <FONT style="font-family: Wingdings">o</FONT></FONT>
</TD>
<TD> </TD>
<TD align="center" valign="top"><FONT style="white-space: nowrap">Accelerated filer <FONT style="font-family: Wingdings">o</FONT></FONT>
</TD>
<TD> </TD>
<TD align="center" valign="top"><FONT style="white-space: nowrap"> Non-accelerated filer <FONT style="font-family: Wingdings">o</FONT> </FONT>
<FONT style="white-space: nowrap">(Do not check if a smaller reporting company)</FONT>
</TD>
<TD> </TD>
<TD align="center" valign="top"><FONT style="white-space: nowrap"> Smaller reporting company <FONT style="font-family: Wingdings">þ</FONT></FONT></TD>
</TR>
<!-- End Table Body -->
</TABLE>
</DIV></BODY></HTML>
这是我的 Python 代码:
import os, sys, string, re
from BeautifulSoup import BeautifulSoup
rawDataFile = "testfile1.html"
f = open(rawDataFile)
soup = BeautifulSoup(f)
f.close()
search = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany'))
print search
问题: 我怎样才能将其设置为依赖于第一次搜索的第二次搜索?因此,当我找到“较小的报告公司”时,我可以搜索接下来的几行,看看是否有 ascii 代码?我一直在浏览汤文档。我尝试执行 find 和 findNext 但无法使其正常工作。
I am parsing an html file and I want to find the part of the file where it says "Smaller Reporting Company" and either has an "X" or Checkbox next to it or doesn't. The checkbox is typically done with the Wingdings font or an ascii code. In the HTML below you'll see it has an þ
in wingdings next to it.
I have no problem showing the results of a regular expression search for the text, but I'm having trouble going the next step and looking for a check box.
I will be using this to parse a number of different html files that won't all follow the same format, but most of them will use a table and ascii text like this example.
Here is the HTML code:
<HTML>
<HEAD><TITLE></TITLE></HEAD>
<BODY>
<DIV align="left">Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, or a smaller reporting company. See the definitions of large accelerated filer, accelerated filer and smaller reporting company. (Check one):
</DIV>
<DIV align="center">
<TABLE style="font-size: 10pt" cellspacing="0" border="0" cellpadding="0" width="100%">
<!-- Begin Table Head -->
<TR valign="bottom">
<TD width="22%"> </TD>
<TD width="3%"> </TD>
<TD width="22%"> </TD>
<TD width="3%"> </TD>
<TD width="22%"> </TD>
<TD width="3%"> </TD>
<TD width="22%"> </TD>
</TR>
<TR></TR>
<!-- End Table Head -->
<!-- Begin Table Body -->
<TR valign="bottom">
<TD align="center" valign="top"><FONT style="white-space: nowrap"> Large accelerated filer <FONT style="font-family: Wingdings">o</FONT></FONT>
</TD>
<TD> </TD>
<TD align="center" valign="top"><FONT style="white-space: nowrap">Accelerated filer <FONT style="font-family: Wingdings">o</FONT></FONT>
</TD>
<TD> </TD>
<TD align="center" valign="top"><FONT style="white-space: nowrap"> Non-accelerated filer <FONT style="font-family: Wingdings">o</FONT> </FONT>
<FONT style="white-space: nowrap">(Do not check if a smaller reporting company)</FONT>
</TD>
<TD> </TD>
<TD align="center" valign="top"><FONT style="white-space: nowrap"> Smaller reporting company <FONT style="font-family: Wingdings">þ</FONT></FONT></TD>
</TR>
<!-- End Table Body -->
</TABLE>
</DIV></BODY></HTML>
Here is my Python code:
import os, sys, string, re
from BeautifulSoup import BeautifulSoup
rawDataFile = "testfile1.html"
f = open(rawDataFile)
soup = BeautifulSoup(f)
f.close()
search = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany'))
print search
Question:
How could I set this up to have a second search that is dependent upon the first search? So when I find "smaller reporting company" I can search the next few lines to see if there is an ascii code? I've been going through the soup docs. I tried to do find and findNext but I haven't been able to get it to work.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果你知道翅膀角色的位置不会改变,你可以使用
.next
。或者你可以上去,然后从那里
查找
:或者你也可以反过来做:
这假设你知道你正在寻找的翅膀特征。
最后一个策略还有一个额外的好处,那就是过滤掉你的正则表达式捕获的其他垃圾,我想你并不真正想要这样做;然后,您可以循环浏览结果,知道您只在正确的列表上工作,因此您可以根据自己的喜好仔细阅读
if
。If you know the position of the wingding character won't change, you can use
.next
.Or you can go up, and then
find
from there:Or you could do it the other way round:
This assume that you know the wingding caharcters you're looking for.
The last strategy has the added bonus of filtering out other crap that your regex is catching, which I suppose you don't really want; you can then just cycle through results knowing that you're only working on the right list, so you can peruse
if
to your liking.您可以尝试迭代结构并检查内部标记内的值或检查外部标记中的值。我不记得如何做到这一点,我最终使用 lxml 来实现这一点,但我认为 bsoup 可能能够做到这一点。
如果您无法使用 bsoup 来完成此操作,请查看 lxml。它可能会更快,具体取决于您正在做什么。它还具有将 bsoup 与 lxml 结合使用的钩子。
You may try iterating through the structure and checking for values inside the inner tags or checking for values in the outer tags. I can't remember off hand how to do it and I ended up using lxml for this, but I think bsoup may be able to do this.
If you can't get bsoup to do it check out lxml. It is potentially faster depending upon what you are doing. It also has hooks for using bsoup with lxml.
lxml
有一个宽容的 HTML 解析器。您不需要 bsoup(现在已被其作者弃用),并且您应该避免使用正则表达式来解析 HTML。这是您正在寻找的内容的第一个粗剪:
这会产生:
lxml
has a tolerant HTML parser. You don't need bsoup (which is now deprecated by its author) and you should avoid regexes for parsing HTML.Here is a first rough cut at what you are looking for:
This produces: